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Introduction to the Special Issue on Advanced 
Learning Technologies 


Vincent Aleven Carole R. Beal 


University of Arizona 


Arthur C. Graesser 
University of Memphis 


The 20 articles in this special issue represent a cross-section of interesting, cutting-edge research in 
advanced learning technologies (ALTs). These advanced technologies are increasingly being used in 
educational practice and as convenient platforms for rigorous educational research. Although defining 
ALTs is difficult, ALTs have 3 key elements to varying degrees. First, these technologies are created by 
designers who have a substantial theoretical and empirical understanding of learners, learning, and the 
targeted subject matter. Second, these systems provide a high degree of interactivity, reflecting a view 
of learning as a complex, constructive activity on the part of learners that can be enhanced with detailed, 
adaptive guidance. Third, the system is capable of assessing learners while they use the system along a 
range of psychological dimensions. The emphasis in the special issue is not exclusively on the 
technologies themselves but more fundamentally on the underlying principles of learning, the interactions 
with the learners, and the impact of the technologies on learning gains. Key challenges are how to 
develop and use the technologies in ways that are grounded in theory, science, and sensible practice. 


© 2013 American Psychological Association 
0022-0663/13/$12.00 DOI: 10.1037/a0034155 


Keywords: advanced learning technologies 


This special issue presents a group of articles that were submit- 
ted in response to a call for papers about advanced learning 
technologies (ALTs). The articles represent a cross-section of 
interesting, cutting-edge research in this area. The response to our 
initial call was overwhelming, testifying to the great research 
activity this topic has generated. We received over 80 abstracts. 
We invited 32 full submissions and, eventually, after the Journal 
of Educational Psychology’s stringent peer review process had run 
its course, we ended up with 20 articles to publish. 

Why is a special issue on ALTs both timely and relevant to the 
readership of the Journal of Educational Psychology? There are 
many good reasons. These advanced technologies are increasingly 
being used in educational research and practice, as well as in 
informal learning settings. The advanced technologies are opening 
doors to new learning experiences. Evidence is accumulating that 
these technologies can have a substantial positive impact on stu- 
dents’ learning outcomes, sometimes even on standardized tests. In 
addition, new technologies are affecting teacher training and class- 
room practice. 





This article was published Online First September 9, 2013. 

Vincent Aleven, Human—Computer Interaction Institute, Carnegie Mel- 
lon University; Carole R. Beal, School of Information: Science, Technol- 
ogy, and Arts, University of Arizona; Arthur C. Graesser, Department of 
Psychology, University of Memphis. 
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This collection of articles illustrates how ALTs are convenient 
platforms for scientific research in addition to addressing applied 
research questions in rigorous ways. They allow for systematic, 
consistent, precisely timed administration of instructional interven- 
tions. They allow for the recording of data about learning pro- 
cesses at a grain size that can be quite difficult to achieve with 
other methods, such as hand-coded observations of classroom 
activities or of students working with technologies. They can 
record the frequency and detail of learning interactions over pro- 
longed periods of time, in real educational settings, as illustrated in 
a number of articles in this special issue. As ALTs scale up and 
become widespread, it is increasingly important to build a strong 
scientific basis regarding the effectiveness of different technolo- 
gies. What works well, with what learners, in what contexts? 

The emphasis in the special issue is not exclusively on the 
technologies themselves but more fundamentally on the underly- 
ing principles of learning, the interactions with the learners, and 
the impact of the technologies on learning gains. Simply put, the 
features of the technologies are grounded in psychological theory 
and empirically tested in assessments of learning and motivation. 
The special issue presents a suite of examples of many such 
technologies and associated theory and evidence. This provides a 
snapshot of the state of the art in ALTs and an introduction to 
readers who are not familiar with the field. 

It is appropriate to define what we mean by ALTs. There is no 
authoritative definition of an ALT, but ALTs do have three key 
elements to varying degrees. First, these technologies are created 
by designers who have a substantial theoretical and empirical 
understanding of learners, learning, and the targeted subject mat- 
ter. Second, these systems provide a high degree of interactivity, 
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reflecting a view of learning as a complex, constructive activity on 
the part of learners that can be enhanced with detailed, adaptive 
guidance. Third, the system is capable of assessing learners while 
they use the system along different psychological dimensions, such 
as mastery of the targeted domain knowledge, application of 
learning strategies, and experiences of affective states. On the 
basis of these assessments, the systems make pedagogical deci- 
sions that attempt to adapt to the needs of individual learners. This 
broad definition encompasses a range of systems and approaches, 
as exemplified in the current special issue. 

It is easy to see how some of the technologies and systems fit 
our definition of an ALT. For example, intelligent tutoring sys- 
tems, pedagogical agents, simulations that provide guidance to 
learners, and the best of educational games (not all of them) fit this 
definition. For other technologies, it is easy to say that they do not 
fit the definition. PowerPoint does not fit the definition very well 
even though it is an advanced tool in many ways. PowerPoint does 
not adapt to individual learners and rarely is set up to be interac- 
tive. SMART Boards are not ALTs in spite of their name. Web 
cameras in the classroom also do not incorporate adaptation or 
automatic assessment of student progress. For some systems, it is 
harder to determine whether or not they should be classified as 
ALTs. What we can say, however, is that for any given system, the 
case is stronger to the extent that it exhibits the three elements 
listed above. Many online courses (e.g., MOOCs, or massive open 
online courses) contain multimedia content such as video lectures, 
but the use of multimedia per se does not make them ALTs. A 
multimedia presentation is not necessarily adaptive and interactive. 
Online courses can be considered more advanced when they, for 
example, (a) contain explanations, prompts, and elaborations that 
are crafted on the basis of theoretical learning principles, a cog- 
nitive task analysis, and/or an analysis of student data; (b) decom- 
pose student contributions with advanced natural language tech- 
nologies, affect sensing components, or other intelligent pattern 
detectors to provide adaptive feedback; and (c) use a sophisticated 
computational algorithm for deciding what prompts to present to 
each student, adapting to system-assessed individual differences. 
Likewise, among the systems that support learners during 
problem-solving activities, some systems are essentially online 
worksheets with answers revealed after the learner has solved each 
problem, whereas others (such as intelligent tutoring systems) can 
provide highly interactive step-by-step guidance while being sen- 
sitive to alternative student strategies and other student variables. 
Moreover, the pedagogical context matters. What is currently 
advanced from the point of view of classroom practice may not be 
advanced from the point of view of researchers. As special issue 
editors, we have tried to embrace a rather broad set of technologies 
that are advanced in at least some of these ways. 

Roughly half of the articles in this special issue investigate 
intelligent tutoring systems or systems that integrate intelligent 
tutoring components with another technology. Examples of the 
latter are games, simulations, and natural language dialogue tech- 
nologies that have an added tutoring component. Intelligent tutor- 
ing systems are defined as systems that provide detailed guidance 
(e.g., through hints, feedback, after-action review, or individual- 
ized problem selection) as learners work through complex problem 
scenarios and hone their understanding and problem-solving skills. 
Of the ALTs featured in the current special issue, intelligent 
tutoring systems are the ones most commonly found in classrooms. 


Nevertheless, the articles in this issue present studies on other 
learning technologies as well. Some articles present systems ca- 
pable of understanding natural language and having a dialogue 
with learners. Others feature systems in which learners interact 
with computer-based pedagogical agents (talking heads) that pro- 
vide guidance. In the case of teachable agents, the learner takes on 
the role of teacher, tapping into the wisdom implicit in the old 
adage that people never learn so well as when they are teaching. 
Several articles present research based on educational games, 
simulations, and virtual learning environments. Finally, the special 
issue includes articles investigating electronic portfolios and the 
use of webcam technology that enables teachers to coach strug- 
gling readers remotely. The latter technology illustrates our point 
made above: Although live communication with live video streams 
over the network is not technologically new, the use of webcams 
for remote coaching by tutors is an innovative use of standard 
technology in a classroom setting. Once again, the emphasis is on 
learning and instruction, not exclusively on technology. 

Much of the research in this special issue is experimental 
research in which innovative technologies are compared against 
challenging comparison conditions that are less technologically 
advanced. In some of the articles, the authors test the incremental 
value and influence of particular technology features on learning. 
Two studies focused on educational data mining, a new strong 
trend in research on ALTs. As learning technologies become more 
prevalent, so do the information-rich data sets collected with these 
technologies. These data sets can be mined for insight into learning 
and learning processes. We frequently see that educational data 
sets are mined in secondary analyses to answer questions that are 
remarkably different from the ones for which the data were col- 
lected. These secondary analyses are often carried out by research- 
ers not involved in the initial data collection. Some researchers 
have even gone out on a limb by predicting that the main impact 
of technology on education will be through educational data min- 
ing. To some degree, that prediction is borne out in the current set 
of articles, which provide insight into students’ learning processes 
through log data collected with these technologies. These logs are 
written automatically by the computer to record interactions be- 
tween the learners and the computer. Finally, the special issue also 
contains two meta-analyses that validate the use of novel technol- 
ogies for tracking assessment of learning and emotions. 

Most of the research presented in this issue involves classroom 
research or secondary analysis of data collected in classrooms. 
This is interesting in a number of ways. First, results from labo- 
ratory experiments do not always survive the transition to class- 
rooms, so it makes sense to work in classrooms to enhance the 
ecological validity of research studies. Second, the fact that much 
research took place in real educational settings illustrates a level of 
maturity of the technologies that are presented. Simply put, they 
have survived the stringent test of being classroom proof. A class 
of middle school students working on experimental learning soft- 
ware is a critical audience who will “explore around the edges”! 
That is, their curiosity leads them to use a learning system in 
unanticipated ways that are hard to emulate using conventional 
software testing methods. As a consequence, previously undiscoy- 
ered flaws in the system are likely to come to light. Third, and 
particularly important, many of the reported studies with ALTs 
showed positive learning gains in actual classrooms. 
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Regarding targeted domains and subject matters, the vast ma- 
jority of the articles examine challenging K-12 subject matters, 
namely, reading, writing, math, and science. However, there also 
are articles in which technologies were used to help learners 
develop softer skills such as intercultural competence and adept- 
ness at tutoring. 

In closing, this collection of articles can be viewed as a harbin- 
ger of things to come in the field of educational psychology. No 


one doubts that computers will play an increasingly fundamental 
role in education. The key challenges are how to develop and use 
the technologies in ways that are grounded in theory, science, and 
sensible practice. The articles in this special issue are 20 examples 
providing a glimpse of how this might be accomplished. 
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Using Adaptive Learning Technologies to Personalize Instruction to 
Student Interests: The Impact of Relevant Contexts on Performance and 


Learning Outcomes 


Candace A. Walkington 
Southern Methodist University 


Adaptive learning technologies are emerging in educational settings as a means to customize instruction 
to learners’ background, experiences, and prior knowledge. Here, a technology-based personalization 
intervention within an intelligent tutoring system (ITS) for secondary mathematics was used to adapt 
instruction to students’ personal interests. We conducted a learning experiment where 145 ninth-grade 
Algebra I students were randomly assigned to 2 conditions in the Cognitive Tutor Algebra ITS. For 1 
instructional unit, half of the students received normal algebra story problems, and half received matched 
problems personalized to their out-of-school interests in areas such as sports, music, and movies. Results 
showed that students in the personalization condition solved problems faster and more accurately within 
the modified unit. The impact of personalization was most pronounced for | skill in particular—writing 
symbolic equations from story scenarios—and for | group of students in particular—students who were 
struggling to learn within the tutoring environment. Once the treatment had been removed, students who 
had received personalization continued to write symbolic equations for normal story problems with 
increasingly complex structures more accurately and with greater efficiency. Thus, we provide evidence 
that interest-based interventions can promote robust learning outcomes—such as transfer and accelerated 
future learning—in secondary mathematics. These interest-based connections may allow for abstract 
ideas to become perceptually grounded in students’ experiences such that they become easier to grasp. 
Adaptive learning technologies that utilize interest may be a powerful way to support learners in gaining 


fluency with abstract representational systems. 


Keywords: intelligent tutoring system, personalization, individual interest, topic interest, algebra 


The computer is the Proteus of machines. Its essence is its universal- 
ity, its power to simulate. Because it can take on a thousand forms and 
can serve a thousand functions, it can appeal to a thousand tastes. 
(Papert, 1980, p. xxi) 


Advanced learning technologies are emerging in educational 
settings as a powerful means to adapt instruction to learners’ 
backgrounds, goals, preferences, and prior knowledge (Papert, 
1980, 1993). Such technologies have the potential to transform the 
very nature of education by allowing for a level of customization 
that can fundamentally change the relationship between the learner 
and the content to be learned (Collins & Halverson, 2009). In 


This article was published Online First September 9, 2013. 
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particular, learning technology innovations allow for instruction to 
be personalized to users’ actions and interests, to provide assis- 
tance when needed and present instruction that is understandable, 
engaging, and situated in relevant and meaningful contexts. One 
well-known example of this type of personalized learning is intel- 
ligent tutoring systems (ITSs)—technology environments that uti- 
lize artificial intelligence to adapt instruction to learner knowledge 
states (Koedinger & Corbett, 2006). As Papert (1980) predicted, 
the computer has truly become the “Proteus of machines,” a 
shape-shifter able to adapt to the infinite variations in the back- 
ground and preferences of its users. 

The rise of such adaptive learning technologies is timely, given the 
pressing issues with motivation that face schools today (Hidi & 
Harackiewicz, 2000). One important principle for adaptive learning 
environments is that instruction may be effective when presented in 
the context of learners’ interests—their predispositions to engage 
with particular topics, ideas, or activities (Hidi & Renninger, 
2006). Research shows that presenting instruction in the context of 
learner’s interests is effective, impacting persistence, attention, and 
engagement (e.g., Ainley, Hidi, & Berndorff, 2002; Flowerday, 
Schraw, & Stevens, 2004; Hidi, 1990, 2001). However, with the 
exception of a few studies in reading (e.g., Heilman, Collins, 
Eskenazi, Juffs, & Wilson, 2010) and basic mathematics (om 
Anand & Ross, 1987; Cordova & Lepper, 1996), there has been 
little research on bringing student interests into adaptive 
technology-based learning environments. 
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One domain that may benefit from interest-based interventions is 
high school algebra. Algebra has been framed as a gatekeeper to 
higher level mathematics, with significant implications for students’ 
economic futures (Kaput, 2000; Moses & Cobb, 2001). Well-known 
issues with motivation and interest occur during adolescence and 
secondary mathematics courses (Fredricks & Eccles, 2002; Frenzel, 
Goetz, Pekrun, & Watt, 2010; McCoy, 2005; Mitchell, 1993). In 
algebra, students make an important transition from working with 
known quantities to using symbols to represent unknown quantities, 
learning skills like writing, manipulating, and solving algebraic ex- 
pressions (Common Core State Standards Initiative, 2010). Interest 
interventions may be designed to provide a type of perceptual ground- 
ing (Goldstone & Son, 2005) for abstract systems of representation, 
making them more situated and understandable. Grounding is accom- 
plished when abstract ideas are related to concrete objects or events 
that learners are familiar with, like their interests. Goldstone and Son 
(2005) found that initial presentation of scientific principles in con- 
crete, grounded form improves transfer when this concreteness is 
faded over time. 

When considering how concrete, interest-based representations 
of ideas may benefit students, it is important to account for the 
desired learning outcomes. An intervention designed to make 
concepts easier and more approachable will not necessarily en- 
hance student learning. This distinction is captured by research on 
desirable difficulties, which has shown that modifications that 
make a task more difficult during training, like decreasing feed- 
back, can actually enhance learning measures like retention 
(Schmidt & Bjork, 1992). To distinguish between immediate per- 
formance during training and students’ long-term learning, 
Koedinger, Corbett, and Perfetti (2012) identified three types of 
robust learning: learning that is retained over time (long-term 
retention), learning that can be applied in new situations (trans- 
fer), and learning that can form the basis for new concepts 
(accelerated future learning). An important goal of research on 
robust learning is to explore instructional principles associated 
with improved outcomes while also determining how these 
principles are impacted by student-level and knowledge-level 
characteristics in different content areas. 

Interest-based interventions designed to promote grounding have 
the potential to improve robust learning because they not only make 
tasks easier for students to conceptualize but may promote worthwhile 
and robust connections between students’ prior knowledge and ab- 
stract systems of representation. In particular, embedding instruction 
in students’ interests may facilitate connections between students’ 
situation models of the actions, events, and relationships in the math- 
ematical scenario they are confronting and the associated problem 
models containing formal notation (Kintsch, 1986; Nathan, Kintsch, 
& Young, 1992). These connections may allow students to use ab- 
stractions effectively and meaningfully, while they remain relatively 
portable to a variety of situations. 

Here, we seek to contribute to the theoretical and empirical 
bases of interest-based interventions by expanding this research to 
a new, more abstract domain and by examining how a robust 
psychological principle of learning, topic interest, can be inte- 
grated into adaptive technologies to promote learning. We explore 
the instructional principle of context personalization (or now, 
personalization for short), which is a type of interest-based inter- 
vention where instructional contexts are matched to students’ 
out-of-school interests (Anand & Ross, 1987; Cordova & Lepper, 


1996). This work takes place within Cognitive Tutor Algebra 
(CTA), an ITS that already contains powerful capabilities to adapt 
instruction to student knowledge. We look at how the resources of 
an ITS can be further leveraged to connect instruction to interests 
and discuss the outcomes of this intervention for robust learning. 


Theoretical Framework 


Individual, Situational, and Topic Interest 


Context personalization is an instructional intervention that is 
hypothesized to mediate learning outcomes by eliciting interest 
(Heilman, Collins, Eskenazi, Juffs, & Wilson, 2010; Reber, Het- 
land, Chen, Norman, & Kobbelvedt, 2009; Renninger, Ewen, & 
Lasher, 2002). Hidi and Renninger (2006) defined interest as “the 
psychological state of engaging or the predisposition to reengage 
with particular classes of objects, events, or ideas over time” (p. 
112) and accentuated that it has cognitive and affective compo- 
nents. Interest unfolds as learners interact with their environment 
(Renninger & Hidi, 2011). The personalization intervention here is 
designed to elicit topic interest—interest triggered when learners 
are presented with a specific topic or theme. Topic interest is 
dependent on characteristics of both the learner and the environ- 
ment and has aspects of both individual and situational interest 
(Ainley et al., 2002; Hidi, 2001). 

Individual interest is the relatively stable and enduring prefer- 
ences held by a learner toward specific activities, objects, or 
events. Individual interest is composed of both stored value, the 
learner’s feelings toward the activity, and stored knowledge, the 
learner’s understanding of the structure and discourse of the ac- 
tivity (Renninger et al., 2002). Situational interest, on the other 
hand, is an attention-focusing and affective reaction to particular 
stimuli or characteristics of the environment, such as coherence, 
salience, personal relevance, or incongruity (Hidi & Renninger, 
2006). Situational interest can be triggered as an intervention grabs 
students’ attention and maintained as students engage with the 
material (Linnenbrink-Garcia et al., 2010). Our personalization 
intervention is both an environmental modification intended to 
trigger and maintain situational interest and a way of leveraging 
individual interests in out-of-school topics. Learners may need 
supports like personalization to initially connect to the content, but 
such connections may need to be meaningful to maintain interest. 
The environment and the learner’s goals and characteristics are 
critical to the development of interest (Renninger & Su, 2012). 


Interest and Prior Knowledge 


Instructional modifications that involve individual interest nec- 
essarily involve prior knowledge—learners are likely to have high 
prior knowledge about their interests (Renninger et al., 2002). The 
effect of interventions that leverage both knowledge and value- 
related components of interest may be especially powerful because 
the interest-based triggers are tied explicitly to the content to be 
learned (Mitchell, 1993; Reber et al., 2009). This is the approach 
of the intervention used here, and it is contrasted with interest- 
based triggers that are not relevant to the learning task, such as 
decorative pictures or the insertion of incidental elements like 
student names (e.g., Cordova & Lepper, 1996). This distinction is 
especially important in algebra, considering that meaningfulness of 
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the mathematical content to secondary students’ lives is an impor- 
tant component of interest and that many students find mathemat- 
ics courses to be without such meaning (Mitchell, 1993). 


Interest and Instructional Outcomes 


Interest can be developed through carefully designed learning 
environments that allow students to connect to the content they are 
learning (Renninger & Hidi, 2011). The activation of interest is 
associated with improved learning (Ainley et al., 2002; Ainley, 
Hillman, & Hidi, 2002; Harackiewicz, Durick, Barron, Linnen- 
brink, & Tauer, 2008; Schiefele, 1990, 1991), as well as with 
increased attention (McDaniel, Waddill, Finstad, & Bourg, 2000; 
Renninger & Wozinak, 1985), persistence (Ainley et al., 2002), 
engagement (Flowerday et al., 2004), reported task involvement 
and perceived competence (Durik & Harackiewicz, 2007), re- 
ported utility value (Hulleman, Godes, Hendricks, & Harackie- 
wicz, 2010), and motivational variables like self-efficacy, self- 
regulation, and achievement goals (Harackiewicz et al., 2008; Hidi 
& Ainley, 2008; Sansone, Fraughton, Zachary, Butner, & Heiner, 
2011). Interest-based interventions can promote academic achieve- 
ment and interest in future courses and careers (Cordova & Lepper, 
1996; Harackiewicz et al., 2012; Hulleman & Harackwicz, 2009). 

To investigate the path from activated interest to increased 
learning, researchers have used response time measures to examine 
attention and persistence. Ainley et al. (2002) found that topic 
interest was associated with students continuing to read a text 
rather than stopping reading, which in turn affected learning out- 
comes. However, other work has found faster response times when 
students have interest in the topic, suggesting that interest facili- 
tates automatic allocation of attention and frees up cognitive 
resources (Hidi, 1990; McDaniel et al., 2000). Such time measures 
have not been well examined in the domain of mathematics, so 
from previous literature, it is unclear whether an interest-based 
intervention would increase problem-solving time by enhancing 
persistence or reduce problem-solving time by facilitating atten- 
tional allocation. Koedinger et al. (2012) discussed measuring 
learning efficiency in interventions—the idea that since instruc- 
tional time is so valuable in classrooms, completing activities in 
less time without reducing learning is an important outcome. They 
observed that “too many theoretical analyses and experimental 
studies do not address the time costs of instructional methods” 
(Koedinger et al., 2012, p. 34). 

Working within an ITS system, there are additional behavioral 
measures that may be important when considering the relationship 
among interest, attention, and persistence. Renninger and Su 
(2012) described how students’ connection to or interest in the 
content can impact how they make use of available supports. ITS 
research has shown that learners can engage in “gaming-the- 
system” behaviors where they take advantage of the tutor’s hints 
and feedback. Learners may quickly click through all of the hints 
available for a problem, until they reach the bottom-out hint giving 
the answer, or they may enter different answers quickly and 
repeatedly, looking for the response the tutor will accept. Gaming 
behaviors have been found to be negatively correlated with learn- 
ing (Baker, Corbett, Koedinger, & Wagner, 2004). Here, an ex- 
amination of both learning efficiency and gaming behaviors is 
provided in order to explore how interest may interact with atten- 
tion and persistence. 


Interest and Mathematics Learning 


An interest-based intervention may be especially effective for 
mathematics learning. Koedinger and Nathan (2004) found that 
algebra story problems and verbal word equations were easier for 
students to solve than matched symbolic equations. One explana- 
tion for this phenomenon is that verbal contexts allow for abstract 
mathematical ideas to become grounded in concrete experiences, 
fostering important connections to prior knowledge (Wilensky, 
1991). Kintsch’s (1986) model of text comprehension can be used 
to hypothesize how interest and prior knowledge may impact 
learning from story problems. In this model, learners create a 
textbase of the propositions supplied directly in the text, as well as 
a situation model that integrates the content of the text with prior 
knowledge. Schiefele (1999) reported that a reader’s level of 
interest in a text’s topic is more highly correlated with deep-level 
processing measures (e.g., comprehension) than surface-level pro- 
cessing measures (e.g., recall). Similarly, McDaniel et al. (2000) 
found that interest increased the learner’s focus on organizing the 
text using structural linkages (perhaps to construct situation mod- 
els), instead of focusing on simply extracting the proposition- 
specific content. Thus, activated interest may be associated with 
more meaningful, detailed, and accurate situation models. 

Kintsch (1986) accentuated that when learners are familiar 
with the situation being described in a mathematics word prob- 
lem, they are more likely to correctly formulate a solution. 
Nathan et al. (1992) found that when students’ construction of 
situation models was supported through animations of the ac- 
tions and relationships in the story, students were better able to 
write equations from story problems. Activating interest may be 
an important method for supporting learners in successfully 
coordinating situation and problem models, providing a means 
of grounding abstract ideas. In an investigation of children 
solving personalized arithmetic problems, Renninger et al. 
(2002) found that personalized scenarios allowed students to 
form potentially powerful connections between the context of 
the story problem and the underlying mathematics content. 
They discussed how problems with interest-based contexts can 
allow some students to focus on the meaning of the problems 
being posed and away from keywords. Similarly, when inves- 
tigating the insertion of interest-based contexts into reading 
passages, they found that these contexts allowed some students 
to focus on extracting meaning from the text and provided a 
scaffold for decoding and recall. Interest-based contexts may 
have provided a means to ground the deeper structure of these 
tasks in students’ prior knowledge. 

Interest may promote grounding by allowing learners to form 
meaningful connections between their qualitative understanding 
of story scenarios and their attempts to mathematically model 
these situations. This could be contrasted with direct transla- 
tion or keyword-type approaches where learners map directly 
from the propositions in the word problem text to a series of 
computations (Hegarty, Mayer, & Monk, 1995). The strength of 
interest-based interventions in mathematics may be related to 
the support they provide for both the construction of accurate 
and meaningful situation models and the successful coordina- 
tion of situation and problem models during problem solving. 
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Literature Review 


Studies of Personalization in Mathematics 


There have been a number of previous studies on personaliza- 
tion in elementary mathematics, which have had mixed results. 
Two studies personalized incidental aspects (e.g., inserting stu- 
dents’ name and favorite food) of computer environments that 
provided instruction on arithmetic and found that students who 
received personalization outperformed control groups on measures 
of learning (Anand & Ross, 1987; Cordova & Lepper, 1996). Two 
other studies found posttest differences for personalization using 
arithmetic story problems (Chen & Liu, 2007; Lopez & Sullivan, 
1992), while an additional study found that personalized arithmetic 
problems were easier for students (Davis-Dorsey, Ross, & Morri- 
son, 1991). However, several studies found no effect for person- 
alization (E. Bates & Wiest, 2004; Cakir & Simsek, 2010; Ku & 
Sullivan, 2000). All of these studies involved elementary school 
content— usually arithmetic problems. The present study is unique 
in that it extends this work into algebra and secondary mathemat- 
ics. As such, we next briefly review the literature on the develop- 
ment of algebraic reasoning. 


The Development of Algebraic Reasoning 


Key to the development of algebraic reasoning is working with 
and using variables to represent unknown quantities. Symboliza- 
tion and symbol manipulation tasks are challenging for students to 
learn (Filloy & Rojano, 1989; Herscovics & Linchevski, 1994; 
Koedinger & Nathan, 2004; Stacey & MacGregor, 1999; Walk- 
ington, Sherman, & Petrosino, 2012). When students write sym- 
bolic equations of functional relationships, they often view these 
equations as a sequence of calculations rather than a statement 
about equality (Breidenbach, Dubinsky, Hawks, & Nichols, 1992; 
Clement, 1982; Humberstone & Reeve, 2008; Stacey & 
MacGregor, 1999). Students erroneously assign shifting or multi- 
ple values to unknowns and may allow one variable to stand for 
two different quantities (Stacey & Macgregor, 1999). As such, 
students have difficulty conceptualizing the idea of operating 
directly on an unknown quantity (Filloy & Rojano, 1989; Hersco- 
vics & Linchevski, 1994; Stacey & MacGregor, 1999). 

The present study focuses on writing symbolic expressions from 
story scenarios, which is a particularly difficult concept (Bardini, 
Pierce, & Stacey, 2004; Heffernan & Koedinger, 1997; Koedinger 


& McLaughlin, 2010; Nathan et al., 1992; Swafford & Langrall, — 


2000; Walkington et al., 2012). Although students may be able to 
describe the relationships in a story problem verbally, using stan- 
dard algebraic notation is more challenging (Bardini et al., 2004; 
Swafford & Langrall, 2000). Students must learn to negotiate the 
mathematical “grammar of such expressions” and can have diffi- 
culty combing or composing different parts of expressions (Koed- 
inger & McLaughlin, 2010, p. 471; see also Heffernan & Koed- 
inger, 1997). Students may not see the utility of writing an 
equation from a story scenario and can have a tendency to solve for 
concrete, specific cases without formulating a general equation 
(Stacey & MacGregor, 1999; Swafford & Langrall, 2000). 
Despite the difficulties learners encounter with algebraic sym- 
bolization, research has shown that using relevant contexts based 
in students’ experiences is an effective method for introducing 


tools of abstraction (Bardini et al., 2004; Carraher, Schliemann, 
Brizuela, & Earnest, 2006; Chazan, 1999; Lampert, 2001; Moses 
& Cobb, 2001). These connections may support students in mak- 
ing the difficult transition to using symbols, especially when 
contexts are linked to students’ interests and experiences through 
personalization. As described previously, personalized contexts 
may activate interest, allowing for greater focus of attention, 
engagement, and persistence in the difficult task of algebraic 
symbolization. Interest-based scenarios may also ground abstract 
ideas in concrete experiences and prior knowledge. Thus, person- 
alization has the potential to support the development of algebraic 
reasoning. 


Personalization in Algebra 


In prior work, we presented 24 ninth-grade Algebra I students 
with a set of algebra story problems that contained both normal 
problems and problems personalized to student interests (Walk- 
ington, Petrosino, & Sherman, in press). Results showed that when 
solving personalized problems, students more often used informal 
strategies that closely mirrored the action of the story (15% of 
responses for normal problems, 42% of responses for personalized 
problems). Students also made fewer conceptual mistakes when 
mathematically formulating the relationships described in a per- 
sonalized story (24% omitted intercept for normal problems, 13% 
omitted intercept for personalized problems). Regression models 
indicated that personalization had a significant and positive impact 
on performance for struggling students who performed poorly on 
their problem set (p < .01) and for problems that incorporated a 
particularly difficult linear function, such as “y = —0.23x + 7.87” 
(p < .05). There were no significant differences related to student 
demographics. For more information on this study’s methodology 
and results, see Walkington et al. (in press). 

Based on this work, we hypothesized that the dependence of the 
effect of personalization on student ability and problem difficulty 
may be the reason why personalization studies have shown mixed 
results. Personalization might be particularly effective in the con- 
text of an ITS, where problems that incorporate different skills are 
selected at the appropriate difficulty level based on a cognitive 
model of student performance. Here, we conducted an experiment 
in the CTA environment where students were randomly assigned 
to receive story problems personalized to their interests or standard 
story problems for one unit. Personalization was accomplished by 
administering an interests survey to both groups and having the 
experimental group receive problem variations corresponding to 
their areas of interest in topics like sports, music, and movies. We 
examined students’ immediate performance on the personalized 
and normal problems, as well as their robust learning from the 
intervention, in terms of accelerated future learning and transfer to 
novel problem formats. We also looked at how receiving person- 
alized problems impacted learning efficiency measures. 


Research Questions 


We pursued three research questions. 

R.1: How does context personalization impact performance 
measures while the intervention is in place? In particular, how 
is this effect moderated by (a) the ability of the learner and (b) 
the difficulty of the task? Based on prior work, we hypothe- 
sized that personalization would enhance performance and that this 
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effect should be strongest when students are struggling to learn 
difficult concepts. 

R.2: How does context personalization impact response time 
measures while the intervention is in place? Although re- 
search on the impact of interest on problem-solving duration is 
mixed, we hypothesized that personalization would reduce time 
spent solving problems. Personalization may impact other time 
measures, such as the tendency to game the system by taking 
advantage of ITS feedback. 

R.3: How does context personalization impact measures of 
robust learning once the intervention is removed? Based on 
the literature on interest, grounding, and personalization, we hy- 
pothesized that personalized contexts would help students under- 
stand the underlying concepts, allowing for transfer to and accel- 
erated future learning in future topics related to linear functions in 
the curriculum. 


Method 


Participants 


The high school at which the study took place was in a rural area 
outside a large city in the northeastern United States. The school 
was 96% Caucasian and 51% male/49% female, and 18% of 
students were eligible for free/reduced lunch. There were three 
Algebra I teachers at the school who taught nine different Algebra 
I classes. Algebra I students at the school were usually classified 
as ninth graders; however, a few 10th and 11th graders were 
enrolled in Algebra I as repeaters. The experiment impacted Unit 
6 of the CTA software. The experiment was “in sequence,” mean- 
ing that students reached Unit 6 in CTA at their own pace as they 
worked in the tutoring environment 2 days each week. There were 
195 Algebra I students with CTA accounts, and of these, 145 (73 
control, 72 experimental) were included in the study. The excluded 
students either had completed Unit 6 prior to the study’s start in 
October (n = 36) or did not reach Unit 6 by the end of the school 
year (n = 14). Since curriculum progress in CTA can be consid- 
ered a proxy for student achievement,' the data analyses for Unit 
6 omit a group of the top-performing students and a few of the very 
weak students in these classes. The CTA data set that was used for 
this study was deidentified, meaning that it contained no informa- 
tion on student background variables like gender and ethnicity. 
This is a limitation of this type of data; however, we have no 
reason to believe that the students included in the study did not 
generally reflect the demographic characteristics of the school, 
especially given the relative uniformity of the school’s demo- 
graphic makeup. 


Learning Environment 


The study took place within the CTA software environment. 
CTA is an ITS for Algebra I that uses model-tracing approaches to 
individualize problem selection and knowledge-tracing approaches 
to individualize hints and feedback (Koedinger & Aleven, 2007). 
CTA focuses on mathematical functions, and students must nego- 
tiate different representational formats (equations, tables, graphs) 
using computational tools (equation-solving tool, spreadsheet). We 
modified one unit in CTA—Unit 6, Linear Models and Indepen- 
dent Variables. This unit contains story problems that involve linear 


functions (see Figure 1) where students write symbolic expressions 
and fill in tables with functional values. There were 27 story scenarios 
in Unit 6, and each could contain several slightly different sets of 
numbers. As is typical for an ITS, the number of problems each 
student received was dependent on the student’s performance. On 
average, each student received 24.39 different problems (SD = 9.46 
problems) and spent a total of 3 hr and 39 min in Unit 6 (SD = 2 hr 
and 25 min) over 6.38 different days (SD = 3.76 days). 

In order to assess robust learning through measures of transfer and 
accelerated future learning, students’ performance in Unit 10 of CTA, 
Linear Models and Four Quadrant Graphs, was also examined. Unit 
10 was the next unit in CTA that covered similar content to Unit 6, 
containing story scenarios on linear functions. The functions in Unit 
10 were more complex than in Unit 6, as they were more likely to 
include fractions and negative terms. Thus performance in Unit 10 
could measure transfer of concepts learned in Unit 6 to more complex 
linear functions, and time measures in Unit 10 could measure accel- 
erated future learning. All story problems in Unit 10 were nonper- 
sonalized, meaning that students in the experimental group would 
need to transfer the skills they learned from solving personalized 
problems in Unit 6 to more complex normal problems in Unit 10. It 
could be argued that the experimental condition was at a disadvantage 
when solving normal problems in Unit 10 since the control condition 
had received normal problems all along during Unit 6. Students 
typically reached Unit 10 one or two months after Unit 6 (M@ = 1.01 
months, SD = 0.77 months). Out of the 145 students who were 
randomly assigned to control or experimental conditions within Unit 
6, 122 made it to Unit 10 before the school year ended. Thus, the 
analyses from Unit 10 likely omitted additional low-performing stu- 
dents (VN = 23). 


Knowledge Components in CTA 


The CTA environment adapts instruction by tracking students’ 
learning of different knowledge components (KCs; Koedinger & 
Aleven, 2007) or mathematical concepts. While the curricular goals of 
CTA come from state standards, the KCs are at a finer grained level. 
They are initially developed by the curriculum designers and refined 
as student data sets are analyzed. Due to the large number of KCs 
assessed in each unit, for the analyses KCs were grouped into three 
categories—easy, medium, and hard. This grouping was based on 
students’ actual performance on different KCs in Units 6 and 10 
(shown in Tables 1-2). The classifications of easy/medium/hard cor- 
responded approximately to performance quartiles of 25%-50%, 
50%-75%, and 75%-100% correct. In Unit 6, two KCs were classi- 
fied as hard (writing symbolic expressions that involved a positive 
slope and intercept or a negative slope and intercept), and in Unit 10, 
expression writing was also classified as hard. 


Materials and Tasks 


Before entering Unit 6, students completed an interests survey 
within CTA where they rated their level of interest (“How much do 


' Curriculum progress is considered by the developers of CTA to be the 
best measure of student knowledge and achievement within the Cognitive 
Tutor environment (S. Ritter, personal communication, 2008). Curriculum 
progress is also a good measure of achievement because it is tied to 
students’ Algebra I course grades. 
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You have just been promoted to assistant manager at 
PAT~E-OH Furniture Inc. and have received a raise to $10.50 
per hour. 


1. How much would you be paid if you worked five hours? 


2. How much would you be paid if you worked 10 and 1/2 hours? 
if you have not already done so, please fill in the 

; oxpression row with an algebraic expression for the total 
pay. Then use the expression and the Solver to answer 
questions 3 and 4 below. 








3. How many hours must you work to make five hundred fifty 
dollars? 





4. in order to make $2,200.00, how many hours must you work? | 






To write the expression, define a variable for the time worked 
and use this variable to write a rule for your total pay. 





Figure 1. 
Key table has been superimposed over the screenshot to show the correct answers to each question. 


you like [topic]?’’) in nine different areas (sports, music, movies, TV, 
games, art, computers, food, and stores) on a 4-point scale (1 = It’s 
boring, 2 = It’s okay, 3 = I like it, 4 = It’s my favorite thing). This 
assessment has been used in our prior work in conjunction with 
student interviews and is intended to measure level of interest in or 
liking of (Renninger & Hidi, 2011) topic areas. Four variations on 
each of the 27 original story problems in Unit 6 were written to 
correspond to different topic interests. The variations had the same 
mathematical structure (i.e., linear function) as the original problem, 
but they had different cover stories relating to different interest cate- 
gories (see Table 3). Only one set of numbers is shown in Table 3 for 
simplicity; however, in this and other actual problems, CTA used 
variations with slightly different number sets to prevent cheating. This 
was controlled for in the regression models. 

Personalized variations were written based on open-ended interest 
surveys conducted in area high schools (VN = 50) and additional 
surveys (N = 22) and interviews (N = 29) conducted in prior work. 
Problems were constructed by taking specific objects, events, or ideas 
that were mentioned in relation to a topic during the surveys or 
interviews and writing algebra problems that corresponded to these 
discussions. Two master teachers of algebra reviewed the personal- 
ized scenarios for their understandability and relevance to high school 
students, and modifications were made accordingly. The personalized 
problems were thus handcrafted for this study. Problem creation can 
be a bottleneck for the development of adaptive interventions. How- 
ever, large banks of personalized problems associated with a full 
mathematics curriculum were recently created for the new ITS 
MATHia (Carnegie Learning, 2011), which personalizes prob- 
lems to student interests such as sports and music and is already 
in use in over 200 schools. This suggests that developers may 
see the potential of such an investment. 
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Screenshot of a question displayed in Unit 6 of the Cognitive Tutor Algebra software. The Answer 


Procedures 


Before entering Unit 6, all participants were given the interests 
survey. Participants were then randomly assigned to control or treat- 
ment groups. The control condition received the normal algebra story 
problems already used within Unit 6, while the treatment condition 
received one of four possible interest-related variations of each 
problem, based on responses to the interest survey. The com- 


Table 1 
Difficulty of Knowledge Components Assessed Within Unit 6 in 
Terms of Student Accuracy 





Knowledge component 


Knowledge component Percent correct classification 
No knowledge component 88.78 Easy 
Identifying independent and 

dependent units 85.47 Easy 
Entering a given 85.31 Easy 
Solving for x, slope only 75.96 Medium 
Using difficult numbers 74.96 Medium 
Using small numbers 74.76 Medium 
Solving for x, negative slope 

with intercept 69.70 Medium 
Solving for x, positive slope 

with intercept 68.74 Medium 
Write expression, slope only 64.79 Medium 
Write expression, negative 

slope with intercept 50.86 Hard 
Write expression, positive 
slope with intercept 44.45 Hard 


Note. No knowledge component cells included labeling independent and 
dependent quantities. 


938 WALKINGTON 


Table 2 
Difficulty of Knowledge Components Assessed Within Unit 10 in 
Terms of Student Accuracy 





Knowledge component 


Knowledge component Percent correct classification 


No knowledge component 92.14 Easy 
Identifying units 84.72 Easy 
Creating graphs of linear 

functions 82.33 Easy 
Entering a given 79.40 Easy 
Using simple numbers ites: Medium 
Using large numbers 70.08 Medium 
Using small numbers 69.80 Medium 
Using difficult numbers 69.49 Medium 
Solving for x, fraction/negative 

slope 60.73 Medium 
Solving for x, positive integer 

slope 51.47 Medium 
Write expression, any form 42.33 Hard 





Note. No knowledge component cells included labeling independent and 
dependent quantities. 


puter would choose the variation that had the highest rated level 
of interest on the interests survey. Personalization was only in 
place for Unit 6, and performance and learning measures were 
collected for Units 6 and 10. 


Data Analysis 


CTA collects detailed logs of students’ interactions with the 
tutoring environment, including how they answered each problem, 
hints and feedback received, and response times. Student logs for 
Unit 6 and Unit 10 of the software were analyzed with multilevel 
models (Snijders & Bosker, 1999) using the R software package 
with the Imer function (D. Bates & Maechler, 2010). This tech- 
nique was used because it allows for logistic modeling of dichot- 
omous outcomes (correct/incorrect response) without the require- 
ment that the data be balanced, fully nested, or fully crossed. Here, 
because of the adaptive nature of the CTA program and the 
personalization, not all students would receive the same problems 
or the same number of problems. 

The Level 1 unit of analyses was repeated observations of each 
student solving one part of one problem (i.e., filling in one cell in 
Figure 1) involving one KC. Random effects included students nested 
within teachers, which story problem and number set the student was 
working on, and the item or linear function (i.e., y = —2.5x — 35 in 
Table 1) underlying that story problem.” Fixed effects included which 
condition the student had been assigned to in Unit 6 (experimental or 
control), the difficulty of the KC being tracked in the problem part 
(easy, medium, hard), and the interaction of condition and KC. The 
dependent measure was either whether the student got the problem 
part correct or incorrect on the first attempt (a logistic regression 
model) or the number of seconds the student took to enter an answer 
to the problem part (a linear regression model). 

Analyses of learning curves (Mathan & Koedinger, 2005) for 
KCs were also conducted. Learning curves show average levels of 
performance on KCs dependent on the number of opportunities 
students have had to practice, which is the number of times they 
have been given a problem part that involves the KC. An analysis 


of whether students in the experimental condition were able to 
learn KCs with less practice was conducted by adding the number 
of practice opportunities and the interaction of condition with 
practice opportunities to the model. For an overview of how these 
equations are modeled and their assumptions, see Snijders and 
Bosker (1999). 


Results 


This section examines how students performed in Units 6 and 10 
in terms of correct responses. We look at performance on different 
KCs and learning curves of how experimental and control students 
mastered concepts over time. We examine how these results varied 
for students struggling with algebra. We then look at response time 
measures, including gaming the system. All of these measures are 
given for Unit 6 (while the intervention was in place) and Unit 10 
(after the intervention was removed). 

The personalization intervention in Unit 6 improved student 
accuracy and decreased response times for hard KCs involving 
writing expressions. In Unit 10, after the intervention had been 
removed for four units, students in the experimental group were 
still more accurate and faster at writing more complex expres- 
sions, suggesting transfer and accelerated future learning. 
Within Unit 6, students in the personalization condition learned 
hard KCs in fewer attempts, and the impact of personalization 
was significantly greater for students identified as struggling 
with algebra. We now describe each of these results in more 
detail in the following sections. 


Performance Measures in Units 6 and 10 


Correct responses. Table 4 shows that students in the person- 
alization condition performed significantly better than the control 
group on hard KCs within Unit 6 (odds = 1.53, p < .001). Trans- 
forming odds into probabilities, the control group’s accuracy when 
writing expressions was 43.20% (0.76/[1 + 0.76]), while the exper- 
imental group had a 53.66% success rate (0.76 X 1.52/[1 + 0.76 X 
1.52]). Personalization also significantly improved performance on 
easy KCs in Unit 6 (odds = 1.52 X 0.94 = 1.43, p < .001). 
Personalization may support fluidity on less mathematically relevant 
parts of problems, like entering givens and labeling quantities. Per- 
sonalization had a nonsignificant but directionally positive effect on 
performance for medium KCs (odds = 1.52 X 0.77 = 1.17, p = .06). 

Unit 10 performance is shown in Table 5. An additional fixed 
effect was added for the number of opportunities each student had 
to solve hard KCs in Unit 6, to control for differential exposure 
between conditions.’ However, results were similar with or with- 
out this term. As shown in Table 5, four units later, with the 


* The item variable was only necessary for Unit 6, since within Unit 6, 
the same item could have different cover stories depending on whether the 
student was in the experimental or control group and which interests were 
selected. 

* The median number of hard KC problem parts presented to students in 
the control group in Unit 6 was 22 (M = 20.53), compared to 19.5 (M = 
22.97) for the experimental group. Being given more opportunities to 
master a KC, rather than being advanced to the next section, usually 
evidences weaker performance. 
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Table 3 


Example of Story Problem From Unit 6 Received by Students in the Control Group, Foliowed by Four Interest-Based Variations of 
This Problem That CTA Chose From for Students in the Experimental Group 


SS 


Interest 


Problem text 





Normal problem (control group) 


An experimental liquid (LOT#XLHS-240) is being tested to determine its behavior under extremely low 


temperatures. Its current temperature is —35 degrees Celsius and is slowly being lowered by two and 


one-half degrees per hour. 


What will the temperature of the liquid be ten hours from now? 

What will the temperature of the liquid be tomorrow at this time? 

When will the temperature drop to one hundred degrees below zero Celsius? 

Assuming that the temperature has been dropping at the same rate, when was the temperature zero 


degrees Celsius? 


Food A new soda at McDonald’s is being tested to determine its behavior under extremely low temperatures. 
Its current temperature is —35 degrees Fahrenheit and is slowly being lowered by two and one-half 


degrees per hour. 
Sports 


A new sports drink is being tested to determine its behavior under extremely low temperatures. Its 


current temperature is —35 degrees Fahrenheit and is slowly being lowered by two and one-half 


degrees per hour. 
Stores 


The Dippin’ Dots store at the mall uses extremely low temperatures to freeze its ice cream into tiny 


balls. Right now, the temperature of a batch of chocolate Dippin’ Dots ice cream is —35 degrees 
Fahrenheit and is slowly being lowered by two and one-half degrees per hour. 


Movies 


The Dippin’ Dots stand at the movie theater uses extremely low temperatures to freeze its ice cream 


into tiny balls. Right now, the current temperature of a batch of chocolate Dippin’ Dots ice cream is 
—35 degrees Fahrenheit and is slowly being lowered by two and one-half degrees per hour. 


Note. In this and some other problems, the units were changed as part of the personalization (e.g., American high school students have everyday 
experience considering temperature in degrees Fahrenheit, not degrees Celsius). CTA = Cognitive Tutor Algebra. 


experimental treatment long removed,’ students who had received 
personalization in Unit 6 were still significantly better at algebraic 
expression writing in Unit 10 (odds = 1.30, p = .0097). The 
knowledge students gained from receiving personalization in Unit 
6 may have transferred to more difficult tasks involving writing 
expressions with negative or fraction slopes from normal story 
problems. There were no significant differences for easy (p = .48) 
or medium (p = .42) KCs in Unit 10. 

Learning curves. Learning curves show average performance 
levels for a KC dependent on the number of opportunities students 
have had to practice that KC. Models that use learning curves are 
important because they control for number of practice opportuni- 
ties, taking into account the possible effects of differential expo- 
sure to KCs. The learning curves for students solving hard KCs in 
Unit 6 are shown in Figure 2. The lack of a clear upward trend in 
the learning curves, especially for the control group, suggests that 
writing symbolic expressions is a challenging skill. 

Opportunity and Condition < Opportunity interaction terms 
were added to the model for Unit 6 to determine if the learning 
curves in Figure 2 were significantly different. Results showed that 
for hard KCs, there was a significant and positive Condition x 
Opportunity interaction (z = 2.09, p = .036) in Unit 6. This 
suggests that as students were given more opportunities to master 
hard KCs, receiving personalization incrementally improved per- 
formance above the control condition. Personalization doubled the 
raw gain students saw from each practice opportunity—with each 
opportunity, odds of a correct response increased by 1.04 for 
control group and 1.08 for the experimental group. Thus, results 
suggest that in Unit 6, students in the experimental group were able 
to learn expression writing with less practice. The difference 
between the learning curves for hard KCs in Unit 10 was not 
significant (p = .79). 

Performance for struggling students. The impact of person- 
alization was also examined based on student prior achievement. 


Within the CTA environment, as previously mentioned, one of the 
best measures of achievement is curriculum progress. An indicator 
variable was added to the model to identify students who did not 
reach Unit 6 until halfway through the data collection period or 
later (February—May). Students in these classes began working in 
CTA in September and were expected to complete 6—7 units every 
9 weeks as part of their Algebra I course. Students who reached 
Unit 6 in February or later (25 out of the 145 students; 13 control, 
12 experimental) were far behind their peers and struggling to 
meet course expectations. As personalization seemed to have the 
largest and most consistent impact on hard KCs, this analysis was 
limited to the problem parts where students were writing expres- 
sions with slope and intercept terms.° Fixed effects included which 
condition the student was in, whether the student was identified as 
a struggling student by the curriculum progress measure, and the 
interaction of these terms. 

As can be seen from Table 6, low-achievement students in the 
personalization condition greatly outperformed low-achievement 
students in the control condition on hard KCs in Unit 6 (odds = 
1.64 X 2.56 = 4.20, p = .037). The raw difference in performance 
for these struggling students was a 24.69% success rate on hard 
KCs for the control group versus a 57.90% success rate on hard 
KCs for the experimental group. Considering only students with 
low prior achievement, personalization had a very large, positive 
impact on performance for algebraic expression writing in Unit 6. 
However, the table also shows that there was a high variation for 
this effect, suggesting that personalization did not have a stable 
impact for all students identified in this manner. This may be a 


* The mean number of months it took students to reach Unit 10 from 
Unit 6 was 1.01 months for the control group and 1.00 months for the 
experimental group. This difference did not approach significance (p = 
90). 

° Analyses using the full data set yielded the same results. 
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Table 4 
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Output for Hierarchical Logistic Regression Model of Performance Within Unit 6 








Fixed effects Raw coefficient SE Odds z value Significance 

Intercept Ai) 0.23 0.76 eel 

Condition-control Ref. 

Condition-experimental 0.42 0.11 1.52 3.78 ape 
KC-hard Ref. 

KC-medium 1.33 0.06 3.78 22.21 nS 
KC-easy 2.18 0.06 8.84 35.69 aa 
Condition-experimental:KC-medium —0.26 0.08 0.77 Song se 
Condition-experimental:KC-easy —0.07 0.09 0.94 =O77 


Note. 


The second column gives raw coefficients for each predictor, which are in logit form, while the fourth 


column transforms the coefficients into odds, a common measure of effect size for logistic regression. Random 
effects are not included for brevity; however, in general, most of the variance was at the student and item levels. 
Hard KCs are the reference in all tables. KC = knowledge component; Ref. = reference category. 


eae) Te) 


result of curriculum progress being only a rough proxy for 
achievement. Personalization still had a significant and sizeable 
positive impact on performance for other students (odds = 1.64, 
p = .006), so the effectiveness of the intervention was not being 
driven entirely by the low-achievement students. 

When considering the data for low-achievement students in Unit 
10, it is important to note that the overall magnitude of the 
differences for hard KCs in Unit 10 was smaller than in Unit 
6—the performance of the control group was 39.67% correct 
(0.66/[1 + 0.66]), while the performance of the experimental 
group was 46.18% correct ([0.66 < 1.30]/[1 + 0.66 X 1.30)). 
However, the impact of having received the treatment might have 
been larger for the group of students identified as having low 
achievement. Unfortunately, since the classification of struggling 
student was based on curriculum progress, only six of the 25 
originally identified struggling students made it to Unit 10 before 
the school year ended. The fact that 19 of the low-achievement 
students for whom personalization was most helpful were omitted 
from this analysis may explain why the magnitude of the differ- 
ence was smaller in Unit 10. 


Time Measures in Units 6 and 10 


Learning efficiency. An analysis of the time it took students 
to solve problem parts in Units 6 and 10 was conducted to examine 


Table 5 


how personalization impacts learning efficiency. The dependent 
quantity in the model was the number of seconds the student spent 
answering the problem part, measured as the time between when 
the problem came up on the screen and when the student finished 
entering an answer. 

Students in the personalization condition spent significantly less 
time (6.93-s reduction, p = .048) solving problem parts involving 
hard KCs in Unit 6 (see Table 7). Students in the experimental 
group had 0.59 correct answers per minute on hard KCs, while 
students in the control group had 0.42 correct answers per minute 
on hard KCs. There were no significant time differences for easy 
(p = .28) or medium KCs (p = .12). A comparison of reading 
times within Unit 6 was also conducted by using the elapsed time 
between when the problem first appeared and when the student 
entered his or her first response as the dependent variable in the 
regression model. Results showed that students who received 
personalization spent significantly less time reading problems 
(7.04-s reduction, p = .025). 

Participants’ time measures in Unit 10 were examined to see if 
receiving personalization in Unit 6 accelerated future learning in 
Unit 10. Students who had received personalization in Unit 6 were 
still faster at writing algebraic expressions in Unit 10 (6.26-s 
reduction, p = .004; see Table 8). In Unit 10, the control group 
achieved 0.35 correct responses per minute on hard KCs, while the 


Output for Hierarchical Logistic Regression Model of Performance Within Unit 
10—Experimental Condition Received Personalization Treatment in Unit 6 








Fixed effects Raw coefficient SE Odds z value Significance 

(Intercept) —0.42 0.18 0.66 2-3 i 
Opportunities in Unit 6 —0.022 0.004 0.98 —5.54 pt 
Condition-control Ref. 

Condition-experimental 0.27 0.10 1.30 2.59 . 
KC-hard Ref. 

KC-medium 1.03 0.07 2.80 14.86 — 
KC-easy 2.41 0.07 11.13 37.03 fia 
Condition-experimental:KC-medium = Onl 0.10 0.86 = 153 
Condition-experimental:KC-easy 0S) 0.09 0.86 eile 


a ee 
Note. Number of opportunities was median centered. KC = knowledge component; Ref. = reference category. 
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Figure 2. Learning curves for experimental and control groups on hard 
knowledge components (KCs) in Unit 6. 


experimental group achieved 0.44 correct responses per minute on 
hard KCs. There were no significant time differences between 
experimental and control groups for easy KCs (p = .92) or 
medium KCs (p = .23) in Unit 10. 

Gaming the system. In ITS systems, students may engage in 
gaming-the-system behaviors where they take advantage of hints 
and feedback. Baker and de Carvalho (2008) developed a gaming 
detector that uses measures such as the elapsed time between re- 
sponses and hint requests to estimate the gaming tendency of each 
student. The gaming detector was run on the log data from Unit 6, and 
the gaming estimates for each student in the control group were 
compared to those in the experimental group using a Student’s test. 
Results showed that students in the experimental group engaged in 
gaming behaviors significantly less often than students the control 
group (¢t = —2.33, p = .037, Cohen’s d = 0.35). There were no 
significant gaming differences in Unit 10 (p = .737). 


Discussion 
Here, we discuss the results as they correspond to each of the 
three research questions. 


Research Questions 


R.1: How does context personalization impact performance 
measures while the intervention is in place? In particular, how 
is this effect moderated by (a) the ability of the learner and (b) 
the difficulty of the task? Results showed that context person- 
alization had a significant positive effect on students’ performance 


Table 6 


on easy and hard KCs in Unit 6. Personalization allowed students 
to more successfully write algebraic expressions from story prob- 
lems when the expression included both a slope and an intercept 
term. This improved accuracy was reflected in students’ learning 
curves—students who received personalization were able to mas- 
ter expression writing after fewer practice opportunities. This 
suggests that personalization may allow students to more easily 
compose the different terms in an algebraic expression (e.g., Hef- 
fernan & Koedinger, 1997; Koedinger & McLaughlin, 2010) and 
make meaning of a grounded, concrete story scenario’s relation to 
an algebraic equation. Personalization was most beneficial for 
students who were struggling to progress within the CTA curric- 
ulum. Personalization may act as a support for students struggling 
to learn formal representational systems who are most in need of 
perceptual grounding. This is consistent with Mayer’s (2001) 
individual differences principle, which states that design effects 
are stronger for low-knowledge learners because high-knowledge 
learners are better able to use prior knowledge to compensate for 
fewer supports within the environment. 

We hypothesize that personalization may help to ground ab- 
stract symbols in concrete experience, allowing them to gain 
situation-based meaning. Writing an algebraic expression from a 
story is a complicated skill that can require both the construction 
of an accurate and meaningful situation model and the successful 
coordination of situation and problem models (Nathan et al., 
1992). These actions may require a deep level of processing of the 
story situation and its quantities and relationships. The finding that 
personalization improves such deep-level processing measures 
supports Renninger et al.’s (2002) conclusion that interest-based 
interventions in mathematics allow learners to build important 
connections between the context and the mathematical content. 
This suggests that the personalization enhancements were not 
seductive details that distracted learners (Clark & Mayer, 2003) or 
interfered with transfer to nonpersonalized problems or abstract 
representations (Sloutsky, Kaminski, & Heckler, 2005). Rather, 
personalization may have allowed participants to better learn and 
understand the underlying skill of writing expressions from story 
scenarios, which involves successful coordination of situation- 
based reasoning with abstract models. 

R.2: How does context personalization impact time mea- 
sures while the intervention is in place? We found that context 
personalization allowed students to spend less time reading and 
solving story problems. The reduced durations seemed targeted 
toward one task in particular—writing symbolic equations. Per- 
sonalization also decreased the incidence of gaming-the-system 
behaviors, or instances where students would enter answers or 


Output for Hierarchical Logistic Regression Model of Performance on Hard KCs Within Unit 6 


Fixed effects 


(Intercept) 

Condition-control 

Condition-experimental 

Regular student 

Struggling student 
Condition-experimental:struggling student 


Note. KC = knowledge component; Ref. = 
api O5te aapee Ole 


Raw coefficient SE Odds z value Significance 
—0'33 0.40 0.72 —0.83 
Ref. 
0.50 0.18 1.64 2.78 Be 
Ref. 
—0.78 0.30 0.48 —2.64 a 
0.94 0.45 2.56 2.09 ‘ 


reference category. 
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Table 7 


Output for Hierarchical Linear Regression Model of Step Duration Within Unit 6 








Fixed effects Estimate SE t value Significance 

(Intercept) 57.90 7.00 8.27 ~~ 
Condition-control Ref. 

Condition-experimental —6.93 3.42 =?) {We ; 
KC-hard Ref. 

KC-medium —18.81 1.90 ao ne 
KC-easy —42.37 1.89 —22.46 = 
Condition-experimental: KC-medium 3.16 2.63 1.20 
Condition-experimental:KC-easy 4.35 2.61 1.66 


Note. KC = knowledge component; Ref. = reference category. 
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request hints rapidly. We hypothesize that along with providing 
grounding for key ideas in algebra, personalization has an impor- 
tant impact on attention and engagement. Interest-based interven- 
tions may lower reaction times by allowing for automatic alloca- 
tion of attention (McDaniel et al., 2000). Writing expressions 
involves the learner working directly with the context of the story 
problem and engaging in the difficult coordination between a 
situation model and a problem model. Personalization may focus 
attention when students engage in deep-level processing to accom- 
plish this coordination. The reduction in gaming behaviors also 
suggests that personalization may orient learner attention toward 
extracting meaning, rather than rapidly requesting hints or feed- 
back. 

R.3: How does context personalization impact measures of 
robust learning once the intervention is removed? Four units 
after the intervention, students who had been in the experimental 
group still had significantly greater accuracy when writing alge- 
braic expressions from normal story scenarios and wrote expres- 
sions in significantly less time compared to the control group. The 
results suggest that personalization promoted robust learning of the 
underlying concept of algebraic expression writing and was asso- 
ciated with accelerated future learning and transfer. The robust 
learning gains for Unit 10 challenge the notion that personalization 
is a crutch whose effects will not persist or transfer to the solving 
of nonpersonalized problems. Instead, the results suggest that 
personalization acts as a scaffold, providing grounding for students 
as they learn important skills relating to coordinating situation and 
problem models when writing algebraic expressions. Students may 


Table 8 


be able to later flexibly apply these skills to normal story problems 
with more complex underlying algebraic expressions. 


Significance 


This study contributes to research on personalization and inter- 
est in several ways. First, we have illustrated the effectiveness of 
an interest-based intervention in a K-12 school during the course 
of regular instruction over an extended period. This allowed for a 
thorough and controlled examination of how performance and 
learning unfold in ecologically valid compulsory school settings 
where high stakes are attached to learning outcomes, especially in 
mathematics. This has been contrasted with research on pull-out 
studies with children (Renninger et al., 2002; Stacey & 
MacGregor, 1999; Walkington et al., 2012), children taking writ- 
ten exams (E. Bates & Wiest, 2004; Koedinger & Nathan, 2004), 
and studies with adult learners or laboratory settings (Durik & 
Harackiewicz, 2007; McDaniel et al., 2000; Reber et al., 2009). 

Although previous research on the impact of personalization in 
mathematics has had mixed results (e.g., Cakir & Simsek, 2010; 
Chen & Liu, 2007), here we have shown that an interest-based 
intervention can be effective for mathematics learning. We hy- 
pothesized that the mixed results in the literature may reflect a 
need to closely match problem difficulty and student ability. We 
thus took a novel approach, implementing personalization in the 
context of an ITS that adapts instruction to current knowledge 
states. This study demonstrates the powerful synergistic effect of 


Output for Hierarchical Linear Regression Model of Step Duration Within Unit 
10—Experimental Condition Received Personalization Treatment in Unit 6 








Fixed effects Estimate SE t value Significance 

(Intercept) 67.49 2.76 24.45 ig 
Condition-control Ref. 

Condition-experimental —6.26 2.16 —2.90 ey 
KC-hard Ref. 

KC-medium = 34 Ay, 1.38 —24.79 ar 
KC-easy S39 1.28 4 lef, ee 
Condition-experimental: KC-medium 8.08 1.91 4.24 sae 
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using technology systems to adapt instruction to both knowledge 
states and personal interests. 

Furthermore, as previous research on interest has largely fo- 
cused on domains like reading or arithmetic, we contribute to an 
understanding of the ways in which interest can mediate learning 
in an advanced domain that emphasizes abstract representations, 
relational thinking, and generalization. Research on the develop- 
ment of algebraic reasoning has accentuated the importance of 
situating algebra learning in rich, engaging mathematical contexts 
(Bardini et al., 2004; Carraher et al., 2006); however, students 
often find secondary courses meaningless with respect to their 
everyday experiences (Mitchell, 1993). Here, we have shown that 
even minor modifications to make problems more interest based 
can have a positive impact on learning. We hypothesized that 
personalization may allow students to learn how to construct 
meaningful situation models of stories and coordinate them with 
problem models by providing perceptual grounding that makes 
abstractions more meaningful. Coordinating situation-based rea- 
soning with formal mathematical modeling is a critical skill not 
only for algebraic expression writing but for many important 
applications of mathematics (e.g., Common Core State Standards 
Initiative, 2010). 


Future Directions 


Discussion of mediators in the relationship between personal- 
ization and improved learning in the present study is speculative, 
as these constructs were not directly measured. In future work, we 
plan to explore the mediators through which personalization is 
associated with learning more directly. We will track at a fine- 
grained level how personalization interacts with measures of mo- 
tivational and metacognitive variables (e.g., situational interest, 
self-efficacy, etc.) using questionnaires (see Bernacki, Nokes- 
Malach, & Aleven, in press). Recent advances in technology-based 
affective state detection also offer an exciting direction for mea- 
suring affect without relying solely on self-report data. We plan to 
incorporate into our intervention affective state detectors (Baker et 
al., 2012) that conduct computational analyses of response behav- 
iors to detect states like engagement, boredom, or frustration. In 
this way, we plan to further delineate the path between a person- 
alization intervention and improved robust learning. 


Conclusion 


The future of adaptive learning technologies in classrooms is 
both promising and generative—the National Academy of Engi- 
neering recently named the development of personalized systems 
for learning as one of the grand challenges for engineering in the 
21st century (Ellis, 2009). Here, we have shown the potential of a 
simple, interest-based adaptation in promoting student learning. 
However, as technology advances, more powerful methods will 
emerge to customize learning to the events, objects, and activities 
that are personally relevant, evocative, and motivating to students 
in K-12 schools today. Personalization systems should be designed 
to leverage students’ interests in authentic and meaningful ways, 
creating systems that allow learners to collaborate around complex 
open-ended tasks that are situated within and adapted to their 
experiences. Such tasks can provide an authentic venue for abstract 
systems of representation to arise as powerful tools for modeling 


the world and making sense of personally relevant phenomena. 
Here, have we shown only a hint of what the potential of such a 
system could be. However, with this work, we contribute to the 
foundation for what we believe the designers of adaptive learning 
technologies should look toward. Students today are accustomed to 
customization, interaction, and control when seeking knowledge— 
modern learning environments and those who design them must 
themselves adapt to the rapid technological changes taking place in 
our world. 
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Recently, there has been growing emphasis on supporting robust learning within intelligent tutoring 
systems, assessed by measures such as transfer to related skills, preparation for future learning, and 
longer term retention. It has been shown that different pedagogical strategies promote robust learning to 
different degrees. However, the student modeling methods embedded within intelligent tutoring systems 
remain focused on assessing basic skill learning rather than robust learning. Recent work has proposed 
models, developed using educational data mining, that infer whether students are acquiring learning that 
transfers to related skills, and prepares the student for future learning (PFL). In this earlier work, evidence 
was presented that these models achieve superior prediction of robust learning to what can be achieved 
by traditional methods for student modeling. However, using these models to drive intervention by 
educational software depends on evidence that these models remain effective within new populations. To 
this end, we analyze the degree to which these detectors remain accurate for an entirely new population 
of high school students. We find limited evidence of degradation for transfer. More degradation is seen 
for PFL. This degradation appears to occur in part because it is generally more difficult to infer this 
construct within the new population. 
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Increasingly, it is thought desirable that students acquire what is 
termed robust knowledge (Koedinger, Corbett, & Perfetti, 2012): 
knowledge grounded in conceptual domain knowledge (Craig, 
VanLehn, & Chi, 2008), which transfers more readily to related 
problem situations (Fong & Nisbett, 1991; Singley & Anderson, 
1989), is retained by students over time (Bahrick, Bahrick, Bah- 
rick, & Bahrick, 1993; Schmidt & Bjork, 1992), and prepares 
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students for more efficient or more effective future learning 
(Bransford & Schwartz, 1999; Schwartz & Martin, 2004). One of 
the well-documented risks in problem solving across STEM (sci- 
ence, technology, engineering, and mathematics) domains is that 
students can develop superficial knowledge that fails these tests of 
robust learning. In particular, when students are not well prepared 
for problem solving, they can develop problem-solving knowledge 
that focuses on surface elements in problem situations, formal 
representations, and features of the learning environment itself 
(Chi, Feltovich, & Glaser, 1981; Rittle-Johnson & Siegler, 1998). 

In line with this shift in perspective, over the past 15 years there 
has been a growing effort by intelligent tutoring system (ITS) 
developers and developers of other intelligent learning environ- 
ments (ILEs) to develop interventions explicitly designed to in- 
crease the robustness of student learning. One general theme has 
been to improve the effectiveness of tutor feedback in supporting 
deep understanding, for example, through natural language tutorial 
dialogues (Graesser et al., 2004; Katz, Connelly, & Wilson, 2007), 
through enhanced student interactivity with graphical feedback 
(Butcher, 2010; Corbett & Trask, 2000), or through focusing 
feedback on domain-independent strategies (Chi & VanLehn, 
2007). A second major approach has focused on incorporating 
student explanations into ITSs, asking students to explain their 
actions in problem solving (Aleven & Koedinger, 2002), or to 
explain worked examples of problem solutions (Corbett et all., 
2011; Hausmann & VanLehn, 2007; McLaren, Lim, & Koedinger, 
2008; Schwonke et al., 2009), toward supporting students in mon- 
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itoring their understanding. Other efforts have focused on training 
metacognitive skills, such as the skill of using a tutoring system’s 
corrective and explanatory feedback effectively (Aleven, 
McLaren, Roll, & Koedinger, 2006; Roll, Aleven, McLaren, & 
Koedinger, 2007), and providing meta-cognitive feedback on stu- 
dents’ skill at self-regulated learning (Chin et al., 2010; Tan & 
Biswas, 2006). 

The advent of interventions that can support the development of 
robust learning raises the question of whether another major ben- 
efit of intelligent tutors and artificial intelligence in education 
(AIED) technologies can be leveraged: individualization. Individ- 
ualization is a major goal of ITS and AIED systems (cf. McCalla, 
1992; VanLehn, 2006), driven by models of students’ latent 
knowledge (cf. Corbett & Anderson, 1995; Martin & VanLehn, 
1995; Shute, 1995). Individualization based on student knowledge 
has had substantial benefits for learners. For instance, Corbett 
(2001) demonstrated that Bayesian student modeling can be used 
to more efficiently distribute problem-solving practice in an ITS, 
leading to a large gain in mean posttest accuracy with only a small 
additional cost in total time on task, compared with a fixed 
curriculum. Bayesian student modeling has also been successfully 
used to monitor student explanations of worked examples in ITSs 
(Conati, Gertner, & VanLehn, 2002; Salden, Koedinger, Renkl, 
Aleven, & McLaren, 2010). 

Efforts to individualize learning environments rely on accurate 
student modeling. The efforts listed above have leveraged models 
of student knowledge that can successfully infer the probability 
that a student knows a specific skill from the student’s history of 
correct responses and noncorrect responses (e.g., errors and hint 
requests) for that skill up until that time (cf. Corbett & Anderson, 
1995; Martin & VanLehn, 1995; Pavlik, Cen, & Koedinger, 2009; 
Shute, 1995). In recent years, the debate about how to best model 
student knowledge has continued, with an increasing number of 
explicit comparisons of models’ ability to predict future perfor- 
mance within the tutoring software studied (cf. Gong, Beck, & 
Heffernan, 2010; Pardos, Gowda, Baker, & Heffernan., 2011; 
Pavlik et al., 2009; Wang & Heffernan, 2011). 

Although these student modeling approaches have been success- 
ful at predicting immediate problem-solving performance and im- 
proving performance on those tests, less attention has been paid to 
modeling the robustness of student learning. Several studies have 
shown that Bayesian student modeling can accurately predict 
immediate posttest performance on the same problem-solving 
skills studied with a tutor (e.g., Baker et al., 2010; Corbett & 
Anderson, 1995; Corbett, Maclaren, Kauffman, Wagner, & Jones, 
2010; Pardos et al., 2011; Shute, 1995), a very limited form of 
transfer. But student models in ITSs have typically not attempted 
to go beyond this point in modeling whether learning is robust. 
Relatedly, some results suggest that Bayesian student modeling 
can be insensitive to differences in students’ depth of understand- 
ing. For example, Corbett and Anderson (1995) reported that 
whereas Bayesian student modeling achieved high correlation to 
student posttest performance in the APT Lisp Tutor, it overesti- 
mated average student posttest performance by 5%-—10%. Tell- 
ingly, Corbett and Bhatnagar (1997) found that the extent to which 
the student model overestimates student test performance is in- 
versely correlated with the each student’s initial declarative knowl- 
edge. In another APT Lisp Tutor study (Corbett & Trask, 2000), 
two groups of students worked to cognitive mastery levels with 


conventional and enhanced feedback related to a difficult topic. 
Although students in the two groups worked to the same nominal 
cognitive mastery criterion, students in the enhanced feedback 
condition scored reliably better on the posttest, again suggesting 
that this type of student modeling may be partially insensitive to 
differences in deep understanding. 

Some steps in the direction of modeling the robustness of 
learning in ITSs have been taken. For example, Jastrzembski, 
Gluck, and Gunzelmann (2006) predict not just posttest perfor- 
mance, but also how long knowledge will be retained after learn- 
ing, within an ITS teaching flight skills. Another step in this 
direction is to assess the transfer of skill within the learning 
system. Much of this work has taken the form of modeling inter- 
connections between skills during learning (cf. Martin & Van- 
Lehn, 1995) or online testing (Desmarais, Meshkinfam, & 
Gagnon, 2006), or in using interconnections between skills to 
revise skill models (Pavlik, Cen, Wu, & Koedinger, 2008). Addi- 
tional, computational modeling has analyzed the mechanisms lead- 
ing to accelerated future learning within a learning system (Li, 
Cohen, & Koedinger, 2010). 

Building on this work, recent work has used data mining to 
develop models that can automatically detect whether student 
knowledge will transfer to related skills outside of the tutoring 
system, and whether students are prepared for future learning 
outside of the tutoring system. The difference between transfer and 
PFL is whether students have the ability to directly apply their 
existing knowledge in novel situations or in new fashions (trans- 
fer), versus whether students can acquire new knowledge more 
quickly or effectively from future instruction, using their existing 
knowledge (preparation for future learning [PFL]). If models are 
developed that accomplish these goals—predicting from in-tutor 
behavior whether a student will be able to successfully transfer her 
or his knowledge out of the tutor to different skills and situations, 
and whether a student will be prepared for future learning outside 
of the tutor—then these models could be used to identify students 
who may be developing superficial knowledge in problem solving 
and in selecting interventions designed to improve the robustness 
of student learning. Students who are already on the road to robust 
learning could continue with existing activities, whereas students 
unlikely to achieve robust learning could receive interventions. 

In this earlier work, robust learning detectors (for both transfer 
and PFL) were developed for a population of undergraduate stu- 
dents using a Cognitive Tutor in the domain of Genetics problem 
solving (Corbett et al., 2010). These detectors were generated by 
engineering complex features related to students’ motivation and 
metacognition and creating a model to predict transfer/PFL from 
these features. They were assessed using cross-validation at the 
student level (e.g., the detectors were repeatedly developed using 
data from one group of students and tested on other students). The 
detectors were found to be better than traditional student modeling 
methods for predicting both transfer and PFL. In this article, we 
study how well these detectors of transfer and PFL generalize at 
the population level, studying the degree to which they transfer to 
a new group of students, specifically, a younger group of high 
school students using the same tutor software. 

In addition to examining the models’ degree of generalization, 
we also analyze the specific student behaviors that are associated 
with robust learning in each population, toward increasing under- 
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standing of the conditions under which robust learning occurs in 
interactive learning systems of this type. 


Learning System 


Cognitive Tutors are a type of interactive learning environment 
in which cognitive modeling and artificial intelligence are used to 
model student learning, in turn using the model of student learning 
to adapt to individual differences in student knowledge and learn- 
ing (Koedinger & Corbett, 2006). Cognitive Tutor curricula com- 
bine conceptual instruction delivered by a teacher with computer- 
based learning where each student works one-on-one with a 
Cognitive Tutoring system that chooses exercises and feedback on 
the basis of a running model of which skills the student possesses 
(Corbett & Anderson, 1995). 

Within a Cognitive Tutor, as the student works through a set of 
problems, Bayesian knowledge tracing (Corbett & Anderson, 
1995) is used to determine how well the student is learning 
component skills, calculating the probability that the student 
knows each skill based on that student’s history of responses 
within the tutor. Using these estimates of student knowledge, the 
tutoring system gives each student problems that are relevant to the 
skills that he or she needs to learn, continuing to provide problems 
until the student reaches mastery (e.g., 95% probability of knowing 
each skill) on all skills relevant to a given curricular area. 

Within this article, we study robust learning in the context of the 
Genetics Cognitive Tutor (Corbett et al., 2010). This tutor consists 
of 19 modules that support problem solving across a wide range of 
topics in genetics (Mendelian transmission, pedigree analysis, 
gene mapping, gene regulation, and population genetics). Various 
subsets of the 19 modules have been piloted at 15 universities in 
North America. This study focuses on a tutor module that uses a 
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gene mapping technique called a three-factor cross (3FC). The 
tutor interface for this reasoning task is displayed in Figure 1. The 
3FC technique is used to determine both the order of three genes 
(F, G, and H in this example), which lie on one chromosome, and 
to find the relative distances between the pairs of genes. In this 
technique, two organisms are crossed (two fruit flies in the exam- 
ple), and the resulting distribution of offspring phenotypes is 
analyzed to infer the arrangement of the three genes on the chro- 
mosome. In Figure 1, the student has almost finished the problem. 
The student has summed the number of offspring in each of four 
phenotype groups that appear in the offspring table, and has 
categorized each group (as “parental single crossover” during 
meiosis, or “double crossover” during meiosis). The student has 
compared the phenotype patterns in the offspring groups to iden- 
tify the middle of the three genes and entered a gene sequence 
below the table. Finally, in the lower right of Figure 1, the student 
has calculated the crossover frequency between two of the genes, 
G and H, and the distance between the two genes. The student will 
perform the last two steps for the other two gene pairs. 


Robust Learning Measures 


The robustness of student learning was measured through two 
tests: a transfer test and a PFL test. A standard pretest and posttest, 
measuring the exact skills studied in the tutor, were also given. 

The transfer test consisted of two problems. The first problem 
was a 3FC task in which double crossovers were so improbable 
that the double-crossover offspring group was missing. This is a 
“gap filling” transfer task (cf. VanLehn, Jones, & Chi, 1992). The 
problem is solvable and most of the students’ problem-solving 
knowledge directly applies; the task examines whether students 
can draw on their understanding of that problem-solving knowl- 


i 7. ina student lab, a test cross was performed between a fruit fly that was 


_ heterozygous for three genes and one that was homozygous recessive. 


| The offspring were scored for the three phenotypes. 


_ The student's data is shown below. 


Determine the gene order and the map distances for the three genes. 
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Figure 1. 


The three-factor cross lesson of the Genetics Cognitive Tutor. 
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edge to fill in the “gap” that results from the missing offspring 
group. The second problem examines whether students can extend 
their understanding of crossovers and crossover notation from 
three genes to four genes. In this problem, students were given a 
parental genotype with four genes and asked to identify how many 
crossovers had occurred in various offspring groups (based on 
phenotype structure rather than relative frequency) and to identify 
all the offspring groups in which a specific crossover had occurred. 
Students completed this transfer test following the problem- 
solving posttest at the end of Session 2. 

It is worth noting that the form of transfer represented by these 
problems can be seen as different from simply transferring knowl- 
edge to an isomorphic problem (cf. Gick & Holyoak, 1987). 
However, transfer problems of the more complex nature seen here, 
requiring some reasoning beyond simply transfer of skill, are 
frequently also seen in research on robust learning in interactive 
learning software (cf. Aleven & Koedinger, 2002; Atkinson, 2002; 
Hausmann & VanLehn, 2007; Mathan & Koedinger, 2005), and 
may represent a deeper test of the robustness of knowledge than an 
isomorphic problem. Interestingly, this more complex type of 
transfer problem is sometimes termed far transfer, but it is not yet 
clear whether it is more difficult for students to modify their 
knowledge to accomplish a related task (the type of transfer seen 
here) or whether it is more difficult for them to realize that their 
existing knowledge applies in a different context (the type of 
transfer studied in Gick & Holyoak, 1987). 

In the PFL test, students were asked to solve parts of a four- 
factor cross problem. The reasoning is related to solving a 3FC 
problem, but sufficiently more complicated that a student could not 
be expected to invent a solution method by direct transfer, and 
certainly not in a short period of time. Consequently, this PFL test 
presented a 2.5-page description of the reasoning in a four-factor 
cross experiment, then asked students to solve some elements of a 
four-factor cross problem: identifying the middle genes, identify- 
ing all the offspring groups with a crossover between two specific 
genes, finding the map distance between those two genes. 


Previous Models 


In Baker, Gowda, and Corbett (2011a, 2011b), we presented 
models that can predict student transfer and PFL. These models 
were developed using data from 72 college students enrolled in 
biology courses at Carnegie Mellon University, who used the 
Genetics Cognitive Tutor for 2 hr apiece. The students used the 
Cognitive Tutor software for 2 hr, completing a total of 22,885 
problem-solving attempts across a total of 10,966 problem steps in 
the tutor. 


Feature Engineering 


The first step of our process of developing models of robust 
learning was to engineer a set of features on the basis of a 
combination of theory and prior work detecting related behaviors. 
We tested a set of 18 features, represented as a set of nine core 
features and nine related features. Features 1—S and their related 
features focus on student interactions with the tutor’s hints and 
feedback. Features 6—8 and their related features focus on the 
student’s problem-solving actions. The ninth feature involves the 
dynamics of the student’s learning, moment by moment. 


1. Help avoidance (Aleven et al., 2006), not requesting help on 
poorly known skills (on the student’s first attempt at a specific 
problem step), and a related feature, Feature 1’, not requesting help 
on well-known skills. 

2. Long pauses after receiving bug messages (error messages 
given when the student’s behavior indicates a known misconcep- 
tion), which may indicate self-explanation (cf. Chi, Bassok, Lewis, 
Reimann, & Glaser, 1989) of the bug message, and its inverse, 
Feature 2’, short pauses after receiving bug messages (indicating a 
failure to self-explain). 

3. Long pauses after reading on-demand help messages (poten- 
tially indicating deeper knowledge or self-explanation), and an 
inverse feature, Feature 3’, short pauses after reading the on- 
demand help message. 

4. Long pauses after reading an on-demand help message and 
getting the current action right (cf. Shih, Koedinger, & Scheines, 
2008), and an inverse feature, Feature 4’, short pauses after reading 
an on-demand hint message and getting the current action right. 
Features 4 and 4’ are subsets of Features 3 and 3’. 

5. Long pauses on skills that the student probably knows (may 
indicate continuing to self-explain even after proceduralization), 
and an inverse feature, Feature 5’, short pauses on skills assessed 
as known. 

6. Off-task behavior (Baker, 2007), where the student is engaged 
in behavior that does not involve the system or a learning task, and 
a related feature, Feature 6’, long pauses that are not off-task (may 
indicate self-explanation, or asking teacher for help; cf. Schofield, 
1995). Off-task behavior is assessed using an automated detector 
(Baker, 2007). 

7. Gaming the system (Baker, Corbett, Roll, & Koedinger, 
2008), attempting to succeed at problem steps without learning the 
material (by clicking through help messages quickly until receiv- 
ing the answer, or systematic guessing), and a related feature, 
Feature 7’, fast actions that do not involve gaming (which may 
indicate a very well-known skill). These features are computed 
using an automated detector of gaming the system (Baker, Corbett, 
et al., 2008). 

8. The student’s average probability of contextual slip/careless- 
ness on errors, making an error when the student is assessed to 
know the relevant skill (known to predict posttest problem-solving 
performance; Baker et al., 2010). This feature is computed using 
an automated detector (Baker et al., 2010). Also, a related feature, 
Feature 8’, the certainty of contextual slip, the average contextual 
slip computed only for values of contextual slip over 0.5; this 
represents how certain the model is when it indicates that a student 
has slipped. 

9. The student’s average learning-per-learning opportunity using 
the moment-by-moment learning model, which estimates the prob- 
ability that the student learned a relevant skill at each step in 
problem solving. Also, a related feature, Feature 9’, the degree to 
which there are spikes in learning, defined as the ratio between the 
maximum moment-by-moment learning and the average moment- 
by-moment learning. 

Many of these features involve a continuous variable, such as 
the time taken between actions or the probability of knowing a 
skill. In general, our detectors do not hinge on a student’s average 
value for the feature (e.g., average time between actions), but 
instead hinge on the proportion of actions that meet a constraint 
(e.g., the proportion of actions with a short pause, or the proportion 
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of actions with a long pause). For each such feature, we empiri- 
cally determined a cutoff value that indicates whether the student 
behavior occurred or not (e.g., a long pause or low probability), 
rather than averaging the actual values (times or probabilities), in 
order to avoid having a small proportion of extreme behaviors of 
interest be overwhelmed by noise in the rest of the student’s data. 

Once feature engineering had been completed, a three-step 
process was conducted to develop a model of transfers and PFL: 
selecting features, optimizing feature cutoffs, and combining the 
features into a unified prediction model. In order to select a set of 
features, we fit a one-parameter linear regression model predicting 
transfer from each feature (or related feature), using correlation as 
the measure of each feature’s goodness. In order to increase the 
probability of a generalizable model, we assessed each model’s 
correlation using student-level leave-out-one-cross-validation 
(LOOCY). In this approach, a model is repeatedly fit for every 
student except one, and then goodness of fit is tested on the left-out 
student. Every student is excluded from the training set and used 
as the test set exactly once. In this situation, each model fit can 
have either a positive or negative coefficient; therefore, the sign of 
a cross-validated correlation does not imply the direction of a 
relationship, but instead implies its consistency. A positive cross- 
validated correlation implies that the models generalize across the 
data, whereas a negative cross-validated correlation implies that 
the models fail to generalize across the data (and the relationship 
actually flips direction for a substantial number of students). Using 
cross-validation in this fashion is considered a valid alternative to 
statistical significance testing (cf. Raftery, 1995), which explicitly 
examines the goodness of the models on new data, rather than 
investigating how well the model fits the data it is trained on 
(Efron & Gong, 1983). 


Transfer Detector 


Only features with positive cross-validated correlation to the 
transfer or PFL test were considered for inclusion in the full model. 

For the transfer detector, nine features met this criterion: Feature 
1 (help avoidance), with a cutoff of 70% probability for “poorly 
known”; Feature 2 (long pauses after a bug message), with a cutoff 
of 7 s for “long”; Feature 2’ (short pauses after a bug message), 
with a cutoff of 1.5 s for “long”; Feature 3 (long pauses after a 
hint), with a cutoff of 8 s for “long”; Feature 4 (long pauses after 
a hint and correct answer), with a cutoff of 12 s for “long”; Feature 


Table | 


6 (off-task behavior); Feature 7 (gaming the system); Feature 7’ 
(fast non-gaming actions), with a cutoff of 2 s for “fast”; and 
Feature 9’ (spikiness in moment-by-moment learning). 

Seven out of nine of these features depend on a threshold 
parameter, N; adjusting a feature’s parameter can result in a very 
different model. For each of these features, we used brute-force 
grid search to find an optimal cutoff level for each of the above- 
mentioned features (in grid search, values are tried for every step 
at the same interval—for instance, 0.5 s, 1 s, 1.5 s, 2 8, etc.). 
Optimality was defined in terms of the ability to predict the 
dependent variable, performance on the transfer test. Variables 
involving probabilities were searched at a grid size of 0.05; vari- 
ables involving time were searched at a grid size of 0.5 s. 

The cross-validated correlations for single-feature regression 
models are shown in Table 1. 

These nine features were considered as potential candidates for 
a unified model (other features, which individually had cross- 
validated correlations below zero, were eliminated from consider- 
ation, as a control on overfitting). To find a unified model com- 
bining multiple parameters, Forward Selection was conducted 
(Ramsey & Schafer, 1997). In Forward Selection, the best single- 
parameter model is chosen, and then the parameter that most 
improves the model is repeatedly added until no more parameters 
can be added that improve the model. The goodness metric used 
was the LOOCV correlation between the predictions and each 
student’s performance on the transfer test. 

The resultant model was 


Transfer = —1.5613 X HelpAvoidance (1) 


+ 0.2968 X FastNotGaming (7') + 0.8272. 


The feature most strongly associated with transfer, both by itself 
and as a component of a unified model, was avoiding help, which 
was negatively associated with transfer (cross-validated r = .376). 
One potential interpretation is that help avoidance directly caused 
lower learning (cf. Aleven et al., 2006), perhaps causing the 
students to have less conceptual learning, as the tutor hints are 
fairly conceptual in nature. This lack of conceptual understanding 
may in turn have made these students less able to transfer their 
knowledge. The other individual feature incorporated into the 
model was fast nongaming actions. These actions were signifi- 
cantly positively associated with transfer. Fast nongaming actions 


Goodness of Single-Feature Linear Regression Models for Predicting Transfer in the College 


Data Set 





Feature 





1. Help avoidance 

9’. Spikiness of moment-by-moment learning 

4. Long pauses after reading hint messages 
and then getting the next action right 

3. Long pauses after reading hint messages 

7’. Fast actions that do not involve gaming 

2. Long pauses after receiving bug messages 

7. Gaming the system 

2’. Short pauses after receiving bug messages 

5. Off-task behavior 





Transfer = Cross-validated r 
hiss oo Bila O91? 376 
—9.758 X FO + 0.951 346 
—6.510 X F4 + 0.893 .204 
—4.075 X F3 + 0.902 .199 
0.484 X F7' + 0.726 188 
—13.497 X F2 + 0.880 130 
—0.2058 X F7 + 0.903 .076 
—4.291 X F2’ + 0.876 OSi; 


= 1.037 X FS =F 0:899 .024 
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may indicate a degree of fluency with the relevant skills that 
facilitates reasoning with them, as hypothesized by Haverty, Koed- 
inger, Klahr, and Alibali (2000), leading to better transfer. 

The cross-validated correlation of the model to the transfer test 
was .396, as shown in Table 2. 


PFL Detector 


The same set of 18 features and model development process 
described in the previous section was used to develop a model of 
students’ PFL. In the case of PFL, five features showed positive 
cross-validated correlations between the individual feature and the 
students’ performance on the PFL test: Feature | (help avoidance), 
with a cutoff of 85% probability for “poorly known”; Feature 3 
(long pauses after a hint), with a cutoff of 8 s for “long”; Feature 
3’ (short pauses after a hint), with a cutoff of 1 s for “short”; 
Feature 4 (long pauses after a hint and correct answer), with a 
cutoff of 8 s for “long”; Feature 4’ (short pauses after a hint and 
correct answer), with a cutoff of 20 s for “short”; Feature 6 
(off-task behavior); Feature 6’ (long pauses that are not off-task), 
with a cutoff of 4 s for “long”; Feature 7 (gaming the system), 
Feature 7’ (fast non-gaming actions), with a cutoff of 4 s for “fast”; 
Feature 9 (average moment-by-moment learning); and Feature 9’ 
(spikiness in moment-by-moment learning). 

Single-feature regression models fit on the whole data set, and their 
associated cross-validated correlations are shown in Table 3 (only 
features with cross-validated correlation over zero are shown). 

These 11 features were considered as potential candidates for a 
unified model. To find a unified model combining multiple parame- 
ters, Forward Selection was conducted, as with the transfer model. 

The resultant models was 


PFL = 0.0127 X Spikiness (9) — 0.5499 X HelpAvoidance (1) 
— 5.3898 X LongPauseAfterHint (4) + 0.8773. 


The feature most strongly associated with PFL was long pauses 
after reading hint messages and getting the next action correct, 
which was somewhat unexpectedly negatively associated with 
PFL (cross-validated r = .410). As with transfer, help avoidance 
was also negatively associated with PFL (cross-validated r = 
.329), and entered into the final model. Finally, the spikiness of the 
student’s learning is positively associated with PFL, and enters 
into the final model, achieving a cross-validated r of .233. This 
finding suggests that PFL is higher if a student’s learning more 
frequently occurs in relatively sudden “aha” moments, as com- 
pared with occurring more gradually; deeper learning is occurring. 


Table 2 
Cross-Validated Correlations Between Models and Tests 








Construct Data developed with Data tested on Cross-validated r 
Transfer College College 396 
Transfer College High school 426 
Transfer High school High school 528 
PEL College College 454 
PFL College High school 228 
PEL High school High school 181 





Note. PFL = preparation for future learning. 


As shown in Table 2, the overall cross-validated correlation of 
the model to the PFL test was .454. 


Transfer and PFL 


Given the existence of models that can predict PFL and transfer 
to a reasonable degree, one question is to what degree these two 
models are capturing the same construct. The two constructs have 
a fairly substantial correlation of .520. However, it is worth study- 
ing whether the two forms of robust learning are characterized by 
the same behaviors during learning. 

The results of these two models seem to suggest substantial 
overlap. First, several of the same data features were found to be 
associated with both transfer and PFL under cross-validation: 1, 3, 
4,5, 7, 7', and 9’. In fact, only two features predicted transfer but 
failed to predict PFL, and only four features predicted PFL but 
failed to predict transfer. 

In addition, each model was successful at predicting the other 
construct. When used to predict PFL, the optimized-feature trans- 
fer detector achieves a correlation of .425, almost as good as the 
optimized model trained to predict PFL. Correspondingly, when 
used to predict transfer, the optimized-feature PFL detector 
achieves a correlation of .395, almost identical to the detector 
trained just to predict transfer. 


Studying the Goodness-of-Transfer and PFL Detectors 
for High School Data 


After developing these detectors, our next goal was to understand how 
well these detectors transfer between different populations of students. To 
this end, data were analyzed for a sample of high school students working 
with the same Genetics Cognitive Tutor module to examine whether the 
robust learning models transfer between two populations who vary in age 
and prior preparation. 


Data Set 


As in the original study, the data used in the second study came 
from the Genetics Cognitive Tutor Three-Factor Cross module. 
Fifty-six high school students who were enrolled in high school 
biology courses used the tutor. The students were recruited to 
participate in the study for pay through several methods, including 
advertisements in a regional newspaper and recruitment handouts 
distributed at two urban high schools. 

The study had the same design as the college-level study. In 
specific, it consisted of two 2-hr sessions, followed by a shorter 
session | week later, all conducted in computer clusters at Carne- 
gie Mellon University. The students engaged in Cognitive Tutor- 
supported activities for 1 hr in each of two sessions. As in the 
original study, students completed a transfer test and PFL test after 
using the tutor, as well as completing a pretest and posttest of the 
exact skills taught in the tutor. All tests were identical to the ones 
used in the previous study. 

The 56 students completed a total of 21,498 problem-solving 
attempts across a total of 9,204 problem steps in the tutor. The 
number of problem-solving attempts per student was not signifi- 
cantly different between the college and high school populations, 
t(126) = 0.847, p = .40. Like the college students, the high school 
students demonstrated successful learning in this tutor, with an 
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Table 3 
Goodness of Single-Feature Linear Regression Models for Predicting PFL in the College 
Data Set 
Feature PFL = Cross-validated r 

4. Long pauses after reading hint message(s) 

and then getting the next action right —7.67 X F4 + 0.961 410 
3. Long pauses after reading hint messages —5.050 X F3 + 0.956 376 
9. Average moment-by-moment learning —8.240 x F9 + 0.979 345 
1. Help avoidance —1.118 x Fl + 0.952 2329 
9’. Spikiness of moment-by-moment learning 0.022 X F9 + 0.740 A233) 
4’. Short pauses after reading hint message(s) 

and then getting the next action right —1.801 X F4’ + 0.937 201 
7’. Fast actions that do not involve gaming 0.350 X F7’ + 0.739 187 
5. Off-task behavior —1.089 & F5 + 0.944 .089 
5’. Long pauses that are not off-task —0.211 X F5’ + 0.976 .083 
3’. Short pauses after reading hint messages 0.173 X F3’ + 0.886 .034 
7. Gaming the system —0.134 X F7 + 0.93 008 


Note. PFL = preparation for future learning. 


average pretest performance of 0.16 (SD = 0.09) and an average 
posttest performance of 0.56 (SD = 0.28), a statistically significant 
difference, (55) = 11.443, p < .001. Students’ average transfer 
test performance was 0.53 (SD = 0.22) and average PFL perfor- 
mance was 0.66 (SD = 0.28). 


Transferring Robust Learning Detectors From College 
Students to High School Students 


To check the generalizability of the transfer and PFL detectors, 
we tested the predictive power of each detector, taking the detec- 
tors developed and optimized using the college data and applying 
them without modification to the high school data set. 

The college detector of transfer achieved a correlation of .426 to 
the transfer test scores within the high school data set. It is worth 
noting that this correlation was higher than the correlation (.396) in 
the college data set, despite the model being transferred to a new 
population. One possible explanation is that there is a closer link 
between in-tutor performance and transfer test performance in the 
high school population than the college population, potentially 
because students were closer to reaching the performance ceiling 
in the original college population. 

By contrast, the college detector of PFL achieved a correlation 
of .228 to the PFL test scores within the high school data set, a 
value that represents substantial degradation compared with the 
data set for which these models was originally developed (where 
the value was 0.454). At the same time, this model remains 
marginally statistically significantly higher than zero (p = .09). 


Building New Robust Learning Detectors 
for High School Students 


In order to fully understand the degree of degradation between 
the college and high school populations, we can build new detec- 
tors for the high school population. Seeing how well these detec- 
tors perform can give us an upper limit for how well this type of 
detector can perform in this data set. It also may be interesting to 
study which data features are important predictors within the high 
school population, to see how these features differ from those used 
in the college population, at a qualitative level. 


A new detector of transfer trained on the data from the high school 
population using optimized features achieves a cross-validated corre- 
lation of .528. This number is moderately higher than the goodness of 
the detector trained on the college population and then applied to this 
data set, which was .426. It is also higher than the performance of the 
goodness of the detector trained on the college population on its 
original data set, which was .396, again indicating that student be- 
havior is more closely linked to performance on the transfer test in the 
high school population than in the college population. 

By contrast, a new detector of PFL trained on the data from the 
high school population using optimized features achieves an unim- 
pressive cross-validated correlation of .181. This number is actually 
lower than the goodness of the detector trained on the college popu- 
lation and then applied to this data set, which was .228. It is also 
substantially lower than the performance of the goodness of the 
detector trained on the college population on its original data set, 
which was .454. This result indicates that the behaviors associated 
with PFL in this new population are not captured well by the feature 
set originally developed within the college population. 


Features Associated With Robust Learning 
in High School Data Set: Transfer 


Within the high school data set, 13 individual features were 
found to have positive cross-validated correlation to the transfer 
test scores. The single-feature linear regression model for each 
feature is given in Table 4. 

There was substantial overlap between the features that had positive 
cross-validated correlations in the college and high school populations. 
Only one of the features that had a positive cross-validated correlation for 
the college population failed to have a positive cross-validated correlation 
for the high school population, short pauses after bug messages (Feature 
2). Of the remaining features, all but two pointed in the same direction in 
both data sets (pointing in the same direction means that the model 
coefficient was either negative in both data sets or positive in both data 
sets). The two that changed direction were the spikiness of moment-by- 
moment learning (negative in the college data set and positive in the high 
school data set) and off-task behavior (negative in the college data set and 
positive in the high school data set). It is worth noting that off-task 
behavior had the weakest relationship that still had a positive cross- 
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Table 4 


Goodness of Optimized Single-Feature Linear Regression Models at Predicting Transfer in High School Data Set 








Feature Transfer = Cross-validated r 
7. Gaming the system —0.9108 X F7 + 0.8482 .496 
9. Average moment-by-moment learning — 16.6448 xX F9 + 0.906 490 
7’. Fast actions that do not involve gaming 0.8805 X F7' + 0.0374 437 
8. Average contextual slip 1.4064 x F8 + 0.0226 429 
8’. Certainty of slip 0.8412 X F8 + 0.2947 409 
3’. Short pauses after reading hint messages —heyekes <1 si (Oey s}5) 396 
3. Long pauses after reading hint messages — 133839) x B34. 06512 391 
1. Help avoidance — 1.6946 X Fl + 0.7475 386 
4. Long pauses after reading hint message(s) and then getting the next action right —1.5936 X F4 + 0.6321 367 
9’. Spikiness of moment-by-moment learning 0.0598 X F9 + 0.2722 362 
4’. Short pauses after reading hint message(s) and then getting the next action right —1.3071 X F4’ + 0.61 350 
2. Long pauses after bug messages —43.8096 X F2 + 0.5588 .200 
5. Off-task behavior 1.7228 X F5 + 0.4554 051 


validated correlation, in both data sets (.024 and .051). Hence, the primary 
noteworthy difference is the relationship for spikiness. 

That said, it is worth noting that many of the features changed 
semantics substantially during parameter optimization. Only one 
feature retained similar semantics between the two data sets, help 
avoidance (Feature 1), which had an optimized cutoff of 70% in 
the college data set, but an optimized cutoff of 50% in the high 
school data set, a relatively minor change. In terms of features that 
changed semantics, Feature 3, long pauses after reading help 
messages, changed from a cutoff of 8 s in the college data set to 1 
s in the high school data set, a substantially different feature. 
Similarly, Feature 4, long pauses after reading help messages and 
then obtaining a correct answer, changed from 12 s in the college 
data set to 1 s in the high school data set. Feature 7’, fast 
non-gaming actions, shifted in the other direction, from 2 s to 20 s. 

Five additional features were also significant in the high school 
model: Feature 3’ (short pauses after a hint), with a cutoff of 17 s 
for “short”; Feature 4’ (short pauses after a hint and correct 
answer), with a cutoff of 17 s for “short”; Feature 8 (average 
contextual slip); Feature 8’ (certainty of contextual slip); Feature 9 
(average moment-by-moment learning). 

A model was fit using Forward Selection, as in the college data 
set. The best model of transfer for the high school data set, using 
the optimal feature cutoffs, and fitting to all data, was as follows: 


Table 5 


Transfer = —0.793 X Gaming (7) + 1.518 
xX Off-task behavior (6) — 34.429 


x LongPauseAfterBug (2) + 0.7587. 


Features Associated With Robust Learning 
in High School Data Set: PFL 


A range of variables were found to have cross-validated corre- 
lations over zero to the PFL test within the high school population, 
shown in Table 5. There was considerable overlap between the 
college and high school populations for these features. Seven of 
the 11 features used in the college detector of PFL were also used 
in the high school detector of PFL (Feature 3, Feature 3’, Feature 
4, Feature 7, Feature 7’, Feature 9, Feature 9’), with all pointing in 
the same direction in the two data sets except for Feature 3’, which 
switched direction. 

However, none of these features had particularly impressive 
correlations taken individually, with the highest cross-validated 
correlation for the high school data set having a value of .137. This 
feature was Feature 9’, the spikiness of the moment-by-moment 
learning model. Two other features had cross-validated correla- 
tions of .1 or higher: the certainty of slip and gaming the system. 
Spikiness and gaming were also found in the college PFL model, 


Goodness of Optimized Single-Feature Linear Regression Models at Predicting PFL in High School Data Set 


a 


Feature 











9’. Spikiness of moment-by-moment learning 
8’. Certainty of slip 

Gaming the system 

Long pauses after reading hint messages 
Average moment-by-moment learning 


Long pauses after bug messages 

‘. Short pauses after reading hint messages 
. Average contextual slip 

7'. Fast actions that do not involve gaming 


PHN AOwWN 


Note. PFL = preparation for future learning. 


Long pauses after reading hint message(s) and then getting the next action right 


PFL = Cross-validated r 
0.045 X F9’ + 0.4622 SB 
0.5802 < F8’ + 0.4941 123 
—0.5002 < F7 + 0.8316 105 
—1.637 X F3 + 0.752 .097 
—9.195 X F9 + 0.865 .092 
—2.3075 X F4 + 0.7452 .073 
—30.6071 X F2 + 0.6819 .059 
—0.744 x F3’ + 0.7193 049 
0.7828 X F8 + 0.3744 045 
0.4773 X F7' + 0.3899 041 


954 BAKER, CORBETT, AND GOWDA 


where the relationships pointed in the same direction as in the high 
school data set. 

A model of PFL was fit using Forward Selection, as in the 
college data set. The best model of PFL for the high school data 
set, using the optimal feature cutoffs, and fitting to all data, was as 
follows: 


PFL = 0.028 X Spikness (9') 
— 1.1901 < LongPauseAfterHint (3) 


—27.343 X LongPauseAfterBug (2) + 0.6214. 


Conclusions 


In this article, we have studied the degree to which automated 
detectors of transfer and PFL transfer to a new cohort of students, 
using the same tutor lesson. These findings establish that it is not 
just possible to identify whether a student has achieved robust 
learning; it is also possible to successfully apply these models on 
a different population than the initial population these detectors 
were developed for, establishing that there is some degree of 
generality in the constructs that these detectors tap. 

The detector of transfer generalized from the college population 
to the high school population with limited evidence of degradation; 
in fact, the detector functioned better within the new population 
than in the original population, though not quite as well as a new 
detector trained specifically for the new population. 

The detector of PFL, however, saw relatively greater evidence 
of degradation between the college and high school population, 
achieving a correlation only about half as high within the high 
school population as had been achieved within the college popu- 
lation. However, it may just be that PFL was relatively difficult to 
detect within the high school population, as a detector trained 
specifically for the new population also functioned relatively 
poorly. 

Between the high school and college populations, many of the 
same features were predictive of transfer and PFL. There was 
substantial overlap in both cases, with seven of nine features that 
had cross-validated correlation over zero in the college data set 
achieving a cross-validated correlation over zero and a coefficient 
pointing in the same direction as in the college model, when 
transferred to the high school data set. Six of 11 features achieved 
this same standard when the college model of PFL was transferred 
to the high schoo! data set, a lower degree of overlap but still an 
indication of considerable similarity between the construct in the 
two data sets. 

Four features were predictive (and pointed in the same direc- 
tion) in every model: Features 3, 4, 7, and 7’. Feature 3, long 
pauses after reading hint messages, and Feature 4, long pauses 
after reading hint messages and providing a correct answer, were 
negatively correlated with robust learning for each construct and 
data set. This does not necessarily mean that these pauses (inter- 
preted as implying self-explanation; cf. Shih et al., 2008) actually 
hurt learning, but may instead indicate a general selection bias 
where the students who seek help are generally less knowledgeable 
(cf. Aleven et al., 2006). These results build on past findings 
regarding relationships between students’ strategies for using help 
and their learning outcomes (cf. Aleven et al., 2006). We recom- 


mend that future research on help seeking and learning consider 
measures of transfer and PFL to a greater degree. 

Feature 7, gaming the system, was also negatively correlated 
with robust learning for each construct and data set, albeit with 
relatively low correlations. This finding accords with previous 
results suggesting that gaming the system is particularly pernicious 
for learning (cf. Cocea, Hershkovitz, & Baker, 2009). 

However, fast nongaming actions were positively correlated 
with robust learning for each construct and data set, with generally 
strong correlations. These actions appear to indicate robust learn- 
ing that leads to both transfer and PFL. Given that fast correct 
actions are also associated with retention (cf. Pavlik & Anderson, 
2008), it appears that rapid correct performance indicates learning 
that is robust in multiple fashions. 

Many other features were associated with robust learning for a 
single construct. Help avoidance was associated with transfer with 
a strong negative correlation in both populations. Previous analysis 
has also revealed negative correlations between help avoidance 
and learning; for example, students who make errors when they 
should have sought help perform more poorly on tests of standard 
problem solving (Aleven et al., 2006). Help in the Genetics Cog- 
nitive Tutor is fairly conceptual in nature; that is, it relates the steps 
in the problem-solving procedure to the properties of the underly- 
ing genetic processes. Our findings suggest that this type of help is 
associated not just with learning to solve the types of problems in 
the tutor, but leads to robust learning as well. Prior work studying 
the learning impact of teaching students when to seek help has not 
had significant effects on problem-solving posttests (Roll, Aleven, 
McLaren, & Koedinger, 2011); it would be worth studying 
whether this type of meta-cognitive instruction impacts perfor- 
mance on measures of robust learning, even if it does not impact 
performance on problem-solving posttests. An alternate explana- 
tion for the negative relationship between help avoidance and 
robust learning in our study—in line with the results in Roll et al. 
(2011)—is that some students are not prepared to learn from the 
types of help in the tutor, leading them to both avoid help and 
demonstrate less robust learning. In general, further attention to 
why students avoid help and how students use help successfully 
and unsuccessfully (cf. Aleven, Stahl, Schworm, Fischer, & Wal- 
lace, 2003) may help us better understand this finding. 

Features of the moment-by-moment learning model were asso- 
ciated with PFL in both data sets. In particular, spikiness as 
measured by the moment-by-moment learning model was posi- 
tively associated with PFL in both data sets. In other recent work, 
researchers have suggested that further distillations of the moment- 
by-moment graph, in particular through explicitly considering the 
visual form of the graph, can be even more predictive of student 
PEL (Baker, Hershkovitz, Rossi, Goldstein, & Gowda, in press). 

In general, this article suggests that models of robust learning can 
be transferred to new populations. As such, these models can be used 
with relative confidence for new groups of students, to drive inter- 
ventions. By doing so, we can move toward the vision of learning 
systems that can adapt effectively to individual differences not just in 
what students know, but in how robust their learning is. 

Another valuable area of future work will be to determine how 
general the phenomena seen here are for new content: new lessons 
within the Genetics Tutor, Cognitive Tutors on other topics, and 
additional learning systems. The models presented here are time- 
consuming in nature to develop; to the extent that general models can 
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be developed, their potential usefulness will be substantially in- 
creased. 
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We provide evidence of persistent gender effects for students using advanced adaptive technology while 
learning mathematics. This technology improves each gender’s learning and affective predispositions 
toward mathematics, but specific features in the software help either female or male students. Gender 
differences were seen in the students’ style of use of the system, motivational goals, affective needs, and 
cognitive/atfective benefits, as well as the impact of affective interventions involving pedagogical agents. 
We describe 4 studies, with hundreds of students in public schools over several years, which suggest that 
technology responses should probably be customized to each gender. This article shows differential 
results before, during, and after the use of adaptive tutoring software, indicating that digital tutoring 
systems can be an important supplement to mathematics classrooms but that male and female students 
should be addressed differently. Female students were more receptive than male students to seeking and 
accepting help provided by the tutoring system and to spending time seeing the hints; thus, they had a 
consistent general trend to benefit more from it, especially when affective learning companions were 
present. In addition, female students expressed positively valenced emotions most often and exhibited 
more productive behaviors when exposed to female characters; these affective pedagogical agents 
encouraged effort and perseverance. This was not the case for male students, who had more positive 
outcomes when no learning companion was present and their worst affective and cognitive outcomes 
when the female character was present. 


Keywords: adaptive learning environments, gender differences, affective agents, motivation and affect, 


quantitative analysis 


Advanced learning technologies have the potential to person- 
alize instruction for students and to meet individual learning 
needs. To provide such personalized instruction, researchers 
must first assess factors that influence student learning. A key 
factor in the domain of mathematics is student gender. To date, 
much of the educational psychology research on motivation and 
achievement has been conducted in standard classrooms with- 
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Literature Review of Gender Differences and 
Performance 


A wealth of motivational and achievement educational psychol- 
ogy research has established the development of gender differ- 
ences in mathematics education, both in the affective and the 
cognitive domain. This section summarizes the literature regarding 
these two aspects, affective and cognitive. 

In relation to achievement, girls generally tend to have higher 
grades in math classes, but boys tend to score higher than girls in 
standardized tests (Hyde, Lindberg, Linn, Ellis, & Williams, 
2008). Female students tend to enroll in less advanced math 
classes, although this trend has been reducing. Various cognitive 
factors have been held responsible for the difference in perfor- 
mance in math tests such as SAT Math and National Assessment 
of Educational Progress. Among them are gender differences in 
basic spatial abilities such as mental rotations (Casey, Nuttall, 
Pezaris, & Benbow, 1995), which are heavily involved not only in 
highly visual (e.g., geometry) math problems but also in mathe- 
matics problems that require approximate solutions and estima- 
tions of magnitude. Differences in verbal components such as the 
speed and accuracy of retrieval of basic arithmetic facts from 
long-term memory into working memory (Royer, Tronsky, Chan, 
Jackson, & Marchant, 1999) are another possibility, ' because 
phonological areas of the brain are heavily involved in exact 
solutions to mathematics problems and arithmetic (Dehaene, 
Spelke, Pinel, Stanescu, & Tsivkin, 1999). Although both nature 
and nurture may be at play to contribute to these differences, 
evidence has suggested that nurture components are heavily at 
play. For instance, studies have shown that action video games can 
virtually eliminate a gender difference in spatial attention (in only 
10 hours of play) and significantly decrease the gender disparity in 
mental rotation ability (Feng, Spence, & Pratt, 2007). Similarly, 
studies have shown how basic math facts retrieval can be trained 
for both speed and accuracy and that this in turn improves math- 
ematics problem-solving ability (Arroyo, Royer, & Woolf, 2011; 
Royer et al., 1999). Thus, a clear possibility is that female and 
male students are not exposed to activities that develop these two 
core areas related to mathematics cognition in equal ways, as 
children grow up, either inside or outside of school. 

The affective component of gender differences in mathematics 
is related to the fact that, as they progress throughout the K-12 
school system, girls report increasingly more negative attitudes 
toward mathematics and express more self-derogating attributions 
about their mathematics performance (Hyde et al., 2008; Royer & 
Walles, 2007). In particular, gender differences have been found in 
early adolescence for mathematics self-concept (belief about one’s 
ability to learn mathematics) and mathematics utility (belief that 
mathematics is important; Eccles, Wigfield, Harold, & Blumen- 
feld, 1993). Additionally, both female and minority students de- 
velop more negative feelings toward mathematics during their 
school years than do the rest of students (Catsambis, 2005), al- 
though there are important racial-ethnic differences, with affective 
gender differences being most pronounced among Latino students. 

It is possible that these affective differences are either a cause or 
a consequence of the gender difference in mathematics perfor- 
mance in standardized tests. Stereotype threat, or the concern that 
others will view one stereotypically (Spencer, Steele, & Quinn, 
1999), has been identified to account for gender differences in 


mathematical problem solving, suggesting that female students 
might receive messages—from peers, parents, teachers, or the 
media—about a possible performance superiority for male math 
ability or about female inferiority. 

However, it is important to note that one of the largest studies 
(Catsambis, 2005) involving thousands of students showed there is 
a trend for all students to decrease their interest in mathematics as 
they progress throughout the school system and increase their 
perception of the difficulty of mathematics, in contrast to other 
school subjects. Thus, generating positive experiences in mathe- 
matics learning for all students, but for female and minority 
students in particular, should be an important goal of mathematics 
education research and practice. 


The Impact of School and Educational Practices on 
Gender Development 


The school functions as a primary setting for developing gender 
orientations. Studies over several past decades have indicated that 
teachers used to pay more attention to boys and interacted with 
them more extensively (Ebbeck, 1984; Fennema, Carpenter, Ja- 
cobs, Franke, & Levi, 1998; Forgasz & Leder, 2006). Boys used to 
receive more praise as well as criticism from teachers in the 
classroom than did girls (Cherry, 1975), and they were more likely 
to be praised for academic success and criticized for misbehavior. 
Girls tended to be praised for tidiness and compliance and criti- 
cized for academic failure, which could undermine their perceived 
self-efficacy (Eccles, 1987). On the other hand, when teachers 
emphasized the usefulness of quantitative skills and encouraged 
cooperative or individualized rather than competitive learning, 
female students showed higher perceived efficacy and valuation of 
mathematics (Eccles, 1989). 

Gender differences have been observed in how male versus 
female students judge their math capabilities with impact on math 
course selection (Eccles, 1987; Hyde & Linn, 1988). Until some 
years ago, female students were known to enroll in significantly 
fewer higher level mathematics, science, and computer courses; to 
have less interest in these subjects; and to view these course as less 
useful than did their male counterparts. For decades, school coun- 
selors have encouraged and supported the interest of male students 
in scientific fields and have steered female students away from 
scientific and technical fields (Fitzgerald & Crites, 1980). Many 
efforts have been made in the past decades to change such prac- 
tices (National Council of Teachers of Mathematics, 2008; Sevo & 
Chubin, 2010). Meanwhile, cross-cultural studies have found that 
gender equity in school enrollment, women’s share of research 
jobs, and women’s parliamentary representation were the most 
powerful predictors of cross-national variability in gender gaps in 
math (Else-Quest, Hyde, & Linn, 2010). 


The Impact of Peers on Gender Development 


As a child’s social world expands outside the home, peer groups 
become another agency of gender development. The peer group is 
often thought of as the primary socializing agency of gender 
development and differentiation (Leaper, 1994). Peers are both the 


' Single- and double-digit addition, subtraction, multiplication, and di- 
vision. 
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product and the contributing producers of gender differentiation 
(Bandura, 1986). They instate gender differentiation by favoring 
same-gender playmates and making sure that their peers conform 
to the conduct expected of their gender. 

Vigilance is needed to ensure that female teachers do not pass 
on their own anxieties and stereotypes to their female students 
(Beilock, Gunderson, Ramirez, & Levine, 2010) and that equitable 
classroom-level interactions are maintained (Boaler, 1997). Stud- 
ies have shown that cooperative learning has been extremely 
effective in mathematics education for female students (Slavin, 
1990; Slavin, Lake, & Groff, 2009). Still, answers from male 
students in a cooperative setting often prevail over those from 
female students, especially during dissension episodes (Wilkinson, 
Lindow, & Chiang, 1985). 

One way that has been found to create positive experiences for 
female students is through personalized instructional practices. In 
one study, students who perceived teachers to be interested in them 
as individuals (personalization), who placed emphasis on investi- 
gative skills (investigation), and who felt classroom participation 
was important were more likely to believe in themselves as capa- 
ble learners of mathematics (Forgasz & Leder, 2006). This per- 
sonalization was more critical for female than for male students. 


Strategic Mathematical Ability and Performance 


As children mature, they proceed through a number of stages in 
their mathematical development. For example, they shift from 
physical (e.g., finger counting) to cognitive representations of 
numbers and operations, which decreases the load on working 
memory (Case et al., 1996). The acquisition of a mental “number 
line” and an increase in working memory capacity support a rich 
and abstract cognitive representation of numbers (Baroody, Tii- 
likainen, & Tai, 2006). Children increasingly develop meta- 


cognitive abilities that enable them to know when, why, and how 
to use new mathematical strategies that provide flexibility in using 
the problem-solving and meta-strategic knowledge strategies they 
possess (Carr, Alexander, & Folds-Bennett, 1994; Montague & 
Jitendra, 2006). 


Self-Regulation and Mathematics Performance 


Self-regulated learners (those who take control of and evaluate 
their own learning and behavior) often hold incremental beliefs 
about intelligence (as opposed to fixed views of intelligence) and 
attribute their successes/failures to factors (e.g., effort expended, a 
particular strategy) within their control (Dweck & Leggett, 1988). 
Effective learners often use metacognition (thinking about one’s 
thinking) and strategic action (planning, monitoring, and evaluat- 
ing). They are highly motivated and have a high sense of self- 
efficacy (Butler & Winne, 1995; Perry, Phillips, & Hutchinson, 
2006; Pintrich & Schunk, 2002; Winne & Perry, 2000), and they 
often exhibit success in and beyond school (Corno et al., 2002; 
Winne & Perry, 2000). Any weakness in a student’s regulation of 
any of these areas can produce a less than optimal use of learning 
software. An example is “gaming the system” (Baker, Corbett, 
Koedinger, & Wagner, 2004), in which students perform actions to 
obtain the answer without seriously trying to learn the underlying 
concepts. 


Theoretical Foundation for a Technology Environment 


In this article, we describe Wayang Outpost, a multimedia-based 
intelligent tutoring system that is the test bed for this research 
(Woolf, 2009). It provides a broad range of pedagogical support 
while students solve mathematics problems of the type that com- 
monly appear on standardized tests (see Figure 1; Arroyo, Beal, 












Sum of thi the values 


A) fx|x299} 
4 (8) {x|*<99,5} 
fx|x299} 
fx[x<99.5} 


Formulas 


Figure 1. 


Dion wants to earn a minimum quiz average of 92% in his 
biology course. His grades so far are 89%, 95%, and 85%. 
Which inequality below represents the possible scores s for his 
next quiz which will allow Dion to achieve his goal? 
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Wayang Outpost. In Wayang Outpost, the male learning companion uses gestures to offer advice and 


encouragement. Students ask for hints or to see the solution, animations, or videos. 
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Murray, Walles, & Woolf, 2004). Developed at the University of 
Massachusetts Amherst, the Wayang Tutor supports strategic and 
problem-solving abilities based on the theory of cognitive appren- 
ticeship (Collins, Brown, & Newman, 1989) in which a master 
teaches skills to an apprentice. In this case, the expert is the 
computer program that assists students to learn tacit processes 
while the program models a solution, provides practice opportu- 
nities with the availability of scaffolded strategies based on the 
multimedia learning theory (Mayer, 2001), and provides metacog- 
nitive scaffolds, such as stopping to reflect on student progress. 


Teaching Strategies in a Technology Environment 


Wayang Outpost is particularly strong at coaching and scaffold- 
ing. It provides synchronized sound, animations, and videos that 
show instructors solving problems and graphic pencils to support 
student drawings (see Figure 1). A big part of cognitive appren- 
ticeship is to challenge students by providing slightly more diffi- 
cult problems than the learner/apprentice could accomplish by 
herself. Vygotsky (1978) referred to this as the zone of proximal 
development and suggested that fostering development within this 
zone led to the most rapid learning. The software provides adaptive 
selection of problems with increased/decreased difficulty depend- 
ing on recent student success and effort (Arroyo, Cooper, Burle- 
son, & Woolf, 2010; Corbett & Anderson, 1995). 


Affective Interventions in a Technology Environment 


Students’ affective states and traits (e.g., frustration, boredom) 
can bias the outcome of any learning situation, whether human- or 
computer-based. Student emotions within a traditional classroom 
have been described as control- or value-oriented (Pekrun, 2006; 
Pekrun, Frenzel, Goetz, & Perry, 2007). This control-value theory 
is based on the premise that student appraisals of control and 
values are central to the arousal of achievement emotions, includ- 
ing activity-related emotions such as enjoyment, frustration, and 
boredom experienced while learning, as well as outcome emotions 
such as joy, hope, pride, anxiety, hopelessness, shame, and anger. 
Students often use coping strategies to regulate their emotions in 
stressful learning situations (e.g., avoidance, humor and accep- 
tance, and negation; Eynde, de Corte, & Verschaffel, 2007). 

Given the importance of affect during learning, there have been 
various efforts related to designing models that can automatically 
recognize student affect (Conati & Maclaren, 2009; D’Mello & 
Graesser, 2012a, 2012b; Muldner, Burleson, & VanLehn, 2010). 
In our prior work, a linear regression model was used to predict 
student emotion based on recent student behavior in the system; 
physical sensors (camera, seat cushion, etc.) were also used to 
accurately help predict students’ self-reports of emotions within 
the software, every 5 minutes, but after a problem was complete 
(Arroyo, Cooper, et al., 2009; Arroyo, Woolf, Royer, & Tai, 2009; 
Cooper, Arroyo, & Woolf, 2011; Cooper et al., 2009). This affec- 
tive model based on sensors was not used for any of the current 
studies in this article, which relied on students’ self-reports of their 
emotions. 

The presence of someone who cares, or at least appears to care, 
can make a student’s experience more personal and help that 
student persist at a task. Brain signals imitate feelings in the body 
of a listener (Rapson, Hatfield, & Cacioppo, 1994); thus, a student 


might register joy or sadness from someone nearby exhibiting 
those emotions. Empathic responses might work when students do 
not feel positive about the learning experience (McQuiggan, 
Rowe, & Lester, 2008). Thus, a computer persona that appears to 
enjoy math experiences could transmit positive experiences to 
students. Same-gender virtual characters are likely to be more 
effective as confidants based on research that same-gender friends 
are more often confidants (Reisman, 1990) and that teenagers are 
more intimate in same-gender friendships (Aukett, Ritchie, & Mill, 
1988). 

Gendered and multicultural companions in Wayang Outpost act 
like peers/study partners who care about a student’s progress and 
offer support and advice (see Figure 2; Arroyo, Cooper, et al., 
2009; Arroyo, Woolf, et al., 2009). Companions were designed to 
appear unimpressed or to simply ignore students’ solutions when 
students did not exert effort; companions praised students who 
exerted effort, even if the answers were wrong. Affective charac- 
ters in Wayang Outpost implemented Dweck’s theory of motiva- 
tion and praise (Dweck, 1999, 2002a, 2002b), which holds that 
students who view their intelligence as fixed and immutable (trait- 
based) tend to shy away from academic challenges, whereas stu- 
dents who believe that intelligence can be increased through effort 
and persistence (state-based) tend to seek out academic challenges. 
Praise, when delivered appropriately, can encourage students to 
view their intelligence as malleable and can support stable self- 
esteem regardless of how hard the students work. In contrast, 
stakeholders (e.g., teachers and parents) may lead students to 
accept a trait-based view of intelligence by praising intelligence, 
rather than effort, thus implying that success and failure depend on 
something beyond the students’ control. Dweck’s recommenda- 
tions were implemented in the messages delivered by Wayang’s 
companions. Tables 1 and 2 present a few of approximately 50 
spoken messages used to motivate students and provide meta- 
cognitive help for effective problem-solving strategies. The com- 
panions speak the messages at the beginning of new problems 
and/or after problem-solving actions. 


General Method 


We investigated specific pedagogical approaches within Way- 
ang through a series of studies conducted over the course of 10 
years. Our goal was to improve both student learning and affective 
outcomes. The conditions in these studies varied, such as the way 





Jane, the female character (female voice) Jake, the male character (male voice) 


Figure 2. Gendered learning companions: Jane, the female affective 
learning companion, and Jake, the male affective learning companion. 
Characters of different racial-ethnic backgrounds were created after these 
studies. Jane and Jake are the two characters that were part of the studies 
reported in this article. 
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Table | 


Sample Messages Given by Learning Companions That Emphasize Effort Over Success, When Students Solve Math Problems 





Answer type Low effort 


High effort 





Incorrect “We kind of rushed to answer that one. Shall we ask the computer 
for help? I am sure we will get it if we take the time to solve 
the problem.” 

Correct “That was good, however, I prefer harder questions so that we 


learn from the help that the computer gives, even if we get 


them wrong.” 


that help was provided or the presence of an animated character 
(with randomized assignment of students to conditions). However, 
the mathematics content remained constant, including the range of 
topics and other help aids (e.g., read-aloud, worked-out examples, 
tutorial videos). 


Subjects and Procedure 


Students within several mathematics classes at a variety of 
public schools in Massachusetts interacted with Wayang Outpost 
for approximately four 1-hr sessions during normal mathematics 
classes, during the span of 1 week. Students completed a pretest 
and posttest assessing their mathematics knowledge through a 
series of questions (approximately 15 question items per test) prior 
to and shortly after using Wayang, respectively. Wayang logged 
all student interface actions for all studies. The test varied per 
study depending on the knowledge units taught, which were se- 
lected depending on grade, school level, and topics the teacher 
chose to cover. 


Instruments 


Students’ mathematics knowledge and problem-solving ability 
were assessed with instruments drawn from the Massachusetts 
state-based test and SAT-Math tests. Each test was composed of 15 
items that were representative of knowledge units and math skills 
taught by Wayang Outpost. The two tests (A and B) were coun- 
terbalanced and randomly assigned, so that half the students re- 
ceived Test A as a pretest and half received Test B (and then 
reversed at posttest time); the test items were presented in random 
order to avoid order effects. Test items were a mix of short-answer 
and multiple-choice items, about half of each (see Figure 3). Two 
items, one easier and one harder, assessed each knowledge unit 
that the system was preset to cover. The items were carefully 
selected, so that the two tests would be fairly equivalent in diffi- 


Table 2 


“These are the hard questions that I like. There is an opportunity 
to learn. Let’s click on the help button.” 


“Hey, congratulations! Your effort paid off, you got it right.” 


culty and would be balanced in short-answer items compared to 
multiple-choice items. As a result, we have not seen significant 
differences in performance between the two tests in any of the 
studies reported in this article, for students receiving either test at 
pretest time. 


Study 1: Large-Scale Pilot (Pilot Study) 


Study 1 (referred to as the pilot study) involved 139 high school 
students (9th-10th graders) from two schools: an urban, low- 
achieving school and a rural, high-achieving school. A single 
condition was used in which students were supposed to initiate 
help when they needed it. Our purpose in this study was to analyze 
the potential of Wayang Outpost to improve mathematics perfor- 
mance (no learning companions were present) and to determine 
behavioral predictors of mathematics learning (in terms of im- 
provement from pretest to posttest). 


Study 2: Design of Help (Help Study) 


Study 2 (referred to as the help study) involved 64 students from 
a traditionally disadvantaged population in an urban public school 
(81% low income) with a large percent of Latino students. We 
used a between-subjects design with two experimental conditions: 
tutor-initiated help (N = 30) and student-initiated help (N = 34), 
with students randomly assigned to either condition. Students in 
both conditions were given access to Wayang hints that would lead 
them toward the solution (they clicked on a button labeled “Help”). 
Only students in the tutor-initiated condition were automatically 
offered help when they made mistakes. In the tutor-initiated con- 
dition, students were given the possibility to reject the help when 
offered. The main hypothesis was that tutor-initiated help would 
probably support students and promote higher learning more than 
would standard help, because it would help to regulate help- 


Sample Messages Given by Learning Companions to Train the Concept of Malleability of 


Intelligence 


———————— 


Malleability training 


“Did you know that when we learn something new our brain 


actually changes? We form new connections inside that help us 
solve problems in the future. Isn’t it amazing?” 


Perseverance training 


“Struggling in problems is actually a good thing, because it means 


that we are learning something new and making our minds 


grow.” 
Demythifying 


“Hey, I found out that people have myths about math, like ‘only 


some people are good at math.’ The truth is that we can all be 


successful in math if we give it a try.” 
Pass es PUNE wes Be ee aa a ea ee 
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Line DE is parallel to the base AC of the triangle ABC. What is the 
measure of angie x? 


[ mearne-potie a bog] 


A circle with center (0, 0) and radius 8 wil! pass through all of the 
following points EXCEPT 


() (8,0) 
@) @,-8) 
© ©,8) 
@) (8,0) ne 
©) @,8) = 


Figure 3. Two items from the math pretest and posttest instruments to assess mathematics knowledge. The 
item on the left is a short-answer item, to assess student knowledge of corresponding angles and internal angles 
of a triangle; the multiple-choice item on the right assesses student knowledge of circles and circle radius, x-y 


coordinates, and logical statements. 


seeking behavior. As was the case with the pilot study, no learning 
companions were involved. 


Study 3: Use of Learning Companions (LC Study) 


Study 3 (referred to as the LC study) involved 233 students from 
a rural-area high-achieving middle school (Grades 7 and 8) and 
used a between-subjects design with two conditions: Wayang with 
learning companions (N = 103) and standard classroom-based 
instruction (N = 100). The hypothesis was that Wayang Outpost 
with learning companions would help students to improve perfor- 
mance outcomes more than would classroom instruction. Students 
were randomly assigned to the gender of the learning companion. 


Study 4: Impact of Learning Companions on Student 
Affect (Affect Study) 


Study 4 (referred to as the affect study) involved 108 students 
(9th—10th graders) from two rural-area high schools in Massachu- 
setts and used a between-subjects design with two conditions: 
learning companion (N = 72) and no learning companion (N = 
36). One goal was to see if learning companions helped to improve 
students’ affective states while students worked with Wayang and 
their affective predispositions and attitudes after using Wayang, 
compared to those of students in the no learning companion 
condition. Every 5 minutes, but only after a mathematics problem 
was complete, the tutor asked students to self-report their 
cognitive-affective emotion. Students were asked to report on their 
interest, frustration, confidence, and excitement (e.g., “How frus- 
trated do you feel?”). The responses were chosen from a 6-point 
scale. Confidence and interest were emotions with bipolar scales, 
with J feel anxious and I am bored at each low end, respectively, 
and J feel confident and Very interested at the high end. 

A correlation analysis validation study over 253 students deter- 
mined that those four emotions were highly similar to several 
activating and deactivating emotions identified by Pekrun (Arroyo, 
Shanabrook, Burleson, & Woolf, 2012) as measured though the 
Achievement Emotions Questionnaire for Mathematics (Pekrun et 
al., 2007). Students were randomly assigned to the conditions and 
to the gender of the learning companion. In addition to taking the 
mathematics pre- and posttests, students filled out an affective 


questionnaire prior to and after using Wayang. The questionnaire 
assessed their affective predispositions toward mathematics learn- 
ing and math problem solving, and it acted as a baseline for the 
four assessed emotions within the system and after the system at 
posttest time. The post survey also included questions about the 
student’s perceptions of the Wayang Tutor (“Did you learn?” “Did 
you like it?” “Was it helpful?” “Was it friendly?”). 

To obtain process data on students’ emotions during their inter- 
action with Wayang, the system prompted students to report their 
affective state every 5 minutes, with the emotion of choice being 
randomly selected in the question. The affective question appeared 
on the screen only after a mathematics problem was complete 
(students were not interrupted while solving a problem). and only 
when the student requested a new activity. 


Results 


Data analyses of these four studies focused on gender differ- 
ences observed before, during, and after students used Wayang. 
This enabled researchers to examine gender differences in sub- 
jects’ affective predispositions and math performance before and 
after the use of the Wayang system and the affective state at each 
time interval, in addition to the problem-solving performance, 
timing, and help activity at every practice problem students saw. 


Gender Differences Before Using Wayang 


The differences in prior mathematics achievement for each 
gender before students used the tutor were small and nonsig- 
nificant across genders, across all studies, with no consistent 
trends in either direction for either gender. Thus, extensive data 
from this research suggest that before the students used the 
technology, no significant gender difference existed in mathe- 
matical ability. 

Affective predispositions toward math problem solving in 
particular were assessed at pretest time through questions of the 
type “How [confident/frustrated] do you feel/get when solving 
math problems?” This gender difference became evident at the 
high school level for “How confident do you feel when solving 
math problems?” in the affect study in 2009 and for another 
study carried out in 2009; this is shown in Table 3. A gender 
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Table 3 

Means, Standard Deviations, and Cohen’s d Effect Sizes for Gender Differences, for Affective Predispositions Before Using the Tutoring System 
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a = SD = 1.21; low-achieving boys M = 3.44, SD = 1.34, Fd, 63) = 5.1, 
ae | z p = .029. In fact, high-achieving girls reported very similar affec- 
ve oe oo 3 tive predispositions to those reported by low-achieving boys (e.g., 

8 5 for confidence, high-achieving girls M = 3.61, SD = 1.08), 

a 5 whereas high-achieving boys reported more positive affective pre- 
eee oa dispositions than did high-achieving girls (for confidence, high- 
2S S| Ses we 
Mere Es achieving boys M = 4.35, SD = 1.39). 

Adda] & 2 > On the other hand, the LC study involving 214 middle school 
ae 2 . students did not show a gender difference in either confidence 
ot iG 3 . : . ° 
so os Ss i 5 or frustration, even though it was carried out in the same school 
ee ate as the affect study and the study marked as 2009 in Table 3. In 
ON COLI i OS te 
ee eS e E fact, Table 3 shows a significant advantage for female students 
5 se 5 in both interest and excitement, which is not evident in any of 
z oes the high school studies. These results suggest an important shift 
>a & gs Pp 
SRR 3S 2 S in gender differences in attitudes toward mathematics problem 
Spnuitiag eg PERS solving at the transition from middle to high school: Female 
= a3 students decrease their interest and excitement toward math 

o . : : e a 
ae ala oe problem solving in relation to their male peers, and they in- 
el : a crease their frustration and decrease their confidence when 
SARS 2 5 5 solving math problems in relation to their male peers. 

i en Similar gender differences in confidence/anxiety with high 

Saiaces & y s 
SLOTeNS As Zs S school students were found in studies carried out in Germany 
= - cS = BH a Vv involving over 500 students, in which female teenagers reported 
oe s | 8 4 a more anxiety, hopelessness, and shame when learning mathemat- 

& Se s ics and taking tests (Frenzel, Pekrun, & Goetz, 2007). Although it 

i . . . . . * 

= = a is unclear whether this affective difference at this age is because of 
* o wl . . > “4: 

+ wn 5 9 = a gender difference in teenage students’ ability to report and 
_ an . . . 
a Se = e @ 2 Ei zm become aware of these emotional experiences, the results in Table 

2.8 : 3 show an important shift from middle to high school. This shift 

a eZ suggests that educational approaches for mathematics education 
PaaS o - z oe must address a deteriorating affective relationship toward math 
ie ee i 2 Ba } problem solving, for female students in particular. 

Ee ati os (ne = ok It is also important to note that all students reported alarm- 

Ss a 3 5 ingly low levels of interest and excitement toward math prob- 
oS Bae g 2 a lem solving, with both genders scoring lower than the neutral 
Site 32 E bp level (3.5) in all studies since year 2008. In particular, low- 
Feo eed is = ra) 3 f cB * * 4° 
aeieie | yg e 5 ‘ achieving high school students in the affect study disliked 

Bal SVE mathematics more, valued it less, had worse perception of their 
aN SE aD o oe ‘i oes : : 

SO a tai a mathematics ability, and reported feeling worse when solving 

OW 2 ; ; ee: 

Sug ae math problems than did high-achieving students (Woolf et al., 
Me as 8 ee 2010). We conclude that it is extremely necessary to make 

e 2 s 5 mathematics more interesting and exciting for all students in 
3 EI E 2 general and that high school girls and low-achieving students 

o’d a e " A e 
p= S Q % g 5 g & especially need support to address activating negative emotions 
Aaal ssa such as frustration and anxiety. 
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Gender Differences While Using Wayang 


While working within the rich context of tutoring systems, 
students display a variety of measurable behaviors (e.g., they use 
help, solve problems, and request hints). The combination of 
errors, hints, and time spent in a math practice problem enables 
researchers to estimate levels of mastery and engagement and to 
distinguish between low mastery and lack of motivation (e.g., 
answering quickly and incorrectly without requesting a hint is 
generally an indication of quick-guessing and disengagement, 
rather than low mastery of a skill). By combining three basic 
behaviors (help seeking, disengaged behavior, and affect), re- 
searchers can better understand more complex learning behaviors 
(Arroyo, Mehranian, & Woolf, 2010). 

Help seeking and responses. Tutoring systems typically pro- 
vide various forms of help and often immediately provide students 
with help as soon as they enter incorrect answers, in part because 
many students do not exhibit effective help-seeking behavior and 
so need scaffolding (Aleven, McLaren, Roll, & Koedinger, 2004; 
Aleven, Stahl, Schworm, Fischer, & Wallace, 2003). Based on 
prior research, Wayang Outpost encourages students to read prob- 
lems, recognize if they can solve the problem, ask for help, spend 
time to understand the hint, and try to solve the problem. However, 
students do not necessarily adhere to this ideal model and tend to 
avoid help or abuse help (Aleven et al., 2004; Baker et al., 2004). 
In fact, the ways in which help seeking occurs are fairly complex; 
Aleven, McLaren, Roll, and Koedinger (2006) identified 57 pro- 
duction rules that capture both effective and ineffective help- 
seeking behavior. Baker et al. (2004) identified a variety of sub- 
optimal help seeking and other behaviors that were considered 
ways of gaming the system. 

Adopting the gaming the system approach, we defined two 
specific behaviors. Help abuse implies that students ask for all 
hints quickly, so that the answer is revealed. Quick-guessing 
entails clicking fast through several attempts, either by spending 
only a few seconds between incorrect attempts or by making a first 
attempt within seconds after a problem is displayed. 

One research issue was to identify how students’ help-seeking 
behavior relates to their learning. As a part of Study 1 (pilot study), 
where help was available but was student initiated, we conducted 
a correlational analysis between students’ help-seeking behavior 
and learning. This study did not include learning companions. 
Help-seeking behavior and particularly spending time on problems 
where help was seen (e.g., spending time thinking about and 
processing help and seeking deeply for more hints when neces- 
sary) were important predictors of student pre- to posttest improve- 
ment (Arroyo et al., 2004). Important gender differences were 
found in this study, including female students working more 
slowly on problems (seconds per problem) and spending more 
time between incorrect attempts. On average, female students 
invested significantly more time on hints, specifically in the total 
minutes on problems where help was seen (for female students, 
M = 94.92 min, SD = 61.21; for male students, M = 76.28 min, 
SD = 62.05, p < .05), although the genders saw similar amounts 
of hints per problem (for female students, M = 0.86 hints per 
problem, SD = 0.81; for male students, M = 0.90 hints per 
problem, SD = .95) and completed similar amounts of problems. 
This is important, because seeking help was found to be a behavior 
conducive to learning (Aleven et al., 2003; Arroyo et al., 2004). It 


suggested that girls were using the system more productively, even 
though there was only a trend for girls to have higher learning in 
that study (pretest to posttest improvement). 

In addition, girls indicated having more positive goals at posttest 
time and reported “seriously trying to learn” more frequently, 
whereas boys reported “trying to get it over with” more often. 
Students’ motivational goals were in turn correlated with produc- 
tive behaviors within the system (e.g., “seeking deeply for help” 
was significantly correlated with pretest to posttest improvement). 
We find similarities between student reports of seriously “trying to 
learn” with measures of learning/mastery orientation (Dweck, 
1999). Girls also perceived the software as more helpful than did 
boys and more often reported that they would use it again. This 
initial study suggested that female students made more productive 
use of the Wayang tutoring system, taking greater advantage of its 
problem-solving support features than did male students and re- 
porting more productive attitudes toward learning. 

The subsequent experimental Study 2 (help study) explored the 
impact of having Wayang explicitly offer hints (instead of expect- 
ing students to request them), given that it was possible that many 
students were not making sufficient use of the available help and 
scaffolding. Although students in general improved more from 
math pretest to math posttest in the tutor-initiated help condition, 
gender differences were observed in students’ acceptance or re- 
jection of that help. When help was explicitly offered and tutor 
initiated, girls tended to accept the hints offered but boys tended to 
refuse them, with the final effect that female students saw more 
hints in total (Cohen’s d = 0.59, p < .05). 

In Study 4 (affect study), we analyzed the impact of learning 
companions for male and female students. We found that the 
presence of learning companions influenced productive help- 
seeking behaviors differentially for male and female students. 
Significant gender differences were observed in what we call 
“productive behaviors,” which refers to the amount of time spent 
on help (Arroyo & Woolf, 2005). In this case, we compared 
students who received learning companions to those who did not. 
A significant interaction between gender and condition on “time in 
helped problems” suggests that, when the female learning com- 
panions were present, female students spent more time on hints 
(more time in problems where help was seen; Cohen’s d = 0.54, 
p <.05). Apparently, female students were searching more deeply 
for help or thinking more about help than were their male coun- 
terparts. 

Disengagement behavior. Tutoring systems are designed for 
an ideal student who behaves in a highly motivated way and tries 
to learn. However, metacognitive bugs do occur while students are 
using tutors (Aleven et al., 2004); also, students can fail to read the 
problem, not seek help, rush through, and in general game the 
system (Baker et al., 2004). Some disengaged behavior is directly 
related to help-seeking activities (e.g., help abuse, which involves 
rushing through hints until the correct answer is revealed). 

In Study 1 (pilot study), girls spent more time on problems and 
less time quick-guessing than did boys (i.e., girls spent more time 
between incorrect attempts for a given problem). Moreover, boys 
saw more problems and girls tended to see fewer problems but 
spent more time on each. However, male students more frequently 
“abused help” by using the help to see the final answer (Cohen’s 
d = 0.83, p < .01). Thus, boys likely rushed through the content 
more than did girls, suggesting they were more frequently disen- 
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gaged. In addition, in the same student-initiated-help condition in 
Study 2 (help study) that matched the condition of the pilot study, 
male students quick-guessed more often (Cohen’s d = 0.42, p < 
.O1). 

In Study 3 (LC study), where all students were assigned to a 
random character, male students abused help more often than did 
female students (Cohen’s d = 0.59, p < .05). 

Last, in Study 4 (affect study), gender differences also were 
evidenced in disengagement behavior connected with having ver- 
sus not having learning companions (Condition < Gender inter- 
action effects). These differences were found for actions such as 
help abuse, quick-guessing, and skipped problems; they indicated 
that girls were more engaged when companions were present. In 
particular, when the female learning companion was present, fe- 
male students were less likely to quick-guess almost two standard 
deviations less than male students (Cohen’s d = 1.80, p < .005). 
On the other hand, male students showed advantages regarding 
disengagement behaviors for the male character in particular: 
Table 4, row 10, shows that male students receiving the male 
character quick-guessed 1.55 standard deviations less than when 
receiving no character at all, thus improving their behavior. 

Gender differences in affect. We observed gender differ- 
ences in students’ self-reported affective experiences while they 
used Wayang Outpost in Study 4 (affect study) and Study 3 (LC 
study). We carried out analyses of variance for each of the affec- 
tive and behavioral dependent variables (post tutor and within 
tutor) shown in Table 4 for data from Study 4 (affect study). 
Covariates consisted of the corresponding pretest baseline variable 
(e.g., when analyzing confidence toward problem solving inside 
the tutoring system or at posttest time, we accounted for the pretest 
baseline confidence). Independent variables corresponded to con- 
dition (e.g., either learning companion [LC present/absent] or 
group [female companion/male companion/no companion]). We 
analyzed both main effects and interaction effects for student 
gender and condition over all student data. 

Students in general reported more interest (less boredom) when 
learning companions were present than when they were absent (see 


Table 4 


Table 4, rows 1 and 2). However, all of the results for other 
outcomes were gender dependent. For instance, female students 
significantly reported lower frustration when working with the 
female character, but this did not happen for male students, nor for 
female students who worked with the male character (see rows 3 
and 4 in Table 4). 

We also analyzed gender differences for these affective out- 
comes (note that gender differences for a single condition are not 
shown on Table 4). We found a gender difference for the no- 
learning-companion condition that indicates that male students 
reported more excitement than female students in the no-learning- 
companion condition (Cohen’s d = 0.68, p < .05) and a trend for 
male students reporting higher levels of interest than female stu- 
dents in the male-character condition (Cohen’s d = 0.47, p < .1). 
Table 4 does show a specific benefit of the Jane female character 
for female students on reports of excitement, compared to the 
control condition (see row 7 on Table 4). 

These results suggest that, when the goal is to reduce students’ 
frustration or increase excitement and interest, girls should receive 
the female learning companion, male students should receive the 
male character or no character at all, and should not receive the 
female character. 


Gender Differences After Using Wayang 


Gender differences in learning. Our studies indicate that, 
after use of Wayang, there are moderate effects of affective char- 
acter presence improving learning gains, depending on gender and 
condition. Female students learned more than male students in 
Study 3 (LC study) when the female (Jane) character was present, 
and female students improved less than male students from pretest 
to posttest when characters were absent in the affect study. 

Even when characters are absent, such as the help study and the 
pilot study, girls in general did not learn significantly more than 
boys with Wayang Outpost. But there was a trend to have higher 
mean learning gains and display more productive behaviors, which 
have been shown to be significant predictors of learning (Arroyo & 


Study 4: Effect Sizes for Posttest and Within-Tutor Emotion Self-Reports 





LC vs. control Female character vs. control Male character vs. control 


All Female Male All 





Female Male All Female Male 
Self-report subjects subjects subjects subjects subjects subjects subjects subjects subjects 
1. Interest within-tutor 0.15 0.03 0.10 0.16 0.06 One 0.16 0.01 0.32 
2. Interest post-tutor 0.29" 0.44 0.18 0.14 0.37 0.18 0.41 0.65° 0.25 
3. Frustration within-tutor —0.26 —0.68" 0.11 —0.46 —0.99""" 0.00 ae ONb2 —0.60 0.21 
4. Frustration post-tutor —0.30 —0.48" —0.16 Sia —0.57' —0.16 —0.20 —0.34 —0.14 
5. Confidence within-tutor —0.04 0.06 —0.10 —0.08 0.06 SWPP 0.01 0.06 —0.02 
6. Confidence post-tutor 0.13 0.26 0.04 0.16 0.41 0.04 0.10 0.09 0.14 
7. Excitement within-tutor 0.11 0.58 —0.16 0.23 0.76" Onl, —0.01 0.34 —0.16 
8. Excitement post-tutor 0.06 0.35 0127, —0.05 0.38 = Ol7 0.14 0.51 —0.08 
9. Productive behavior: Time on hints 0.18 0.36 —0.02 0.26 0.53 OND 0.05 —0.26 0.39 
10. Disengagement: Quick-guessing —0.02 —0.59" —0.50 —0.10 —0.41 0.58" —0.07 0.49 = 5) a 


Note. This table shows effect sizes (Cohen’s d) for having/not having a learning companion in general (columns 2-4), and specifically for students who 
got the female affective character (columns 5-7) or for students who got the male affective character (columns 8-10). In all cases, the control condition 
corresponds to students receiving no characters. A positive number indicates that the mean was higher for the experimental condition; a negative one 
indicates a higher mean for the control without affective characters. Significance level corresponds to a main effect for condition, the Fisher test of 
between-subjects for the corresponding analysis of variance, with pretest baseline as a covariate. Numbers in bold type indicate significant values. LC = 
learning companion and includes the female and male characters. 

* Near significant at p< .1. ~* Significant at p< .05. “Significant at p < .005. 
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Woolf, 2005). For instance, in the help study, all students learned 
more in the tutor-initiated help condition, F(1, 63) = 6.38, p = .15. 
However, female students had a tendency to improve more in that 
condition (female improvement from pretest to posttest, M = 0.24 
{i.e., 24%], SD = 0.21; male improvement, M = 0.12, SD = 0.24, 
Cohen’s d = 0.57). This is presumably because female students 
accepted more hints than did male students, thus receiving more 
support, and this translated into a trend for higher mean improve- 
ment. 

Significantly higher learning was observed in Study 3 (LC 
study) by matching the gender of the companion to the gender of 
the student. Further analyses consisted of the influence of a 
gender-matched variable that would be true when the gender of 
student and character were the same. The result was a significant 
effect for matched gender at predicting learning gains, F(1, 88) = 
3.5, p = .048, which confirms that students improve their math 
problem-solving performance more with a character that matches 
their own gender. 

Further analyses of Study 2 (help study) revealed a general trend 
for students to learn more with the experimental version of Way- 
ang that offered hints when incorrect attempts were made. Student 
learning gains revealed a main effect for condition, F(1, 50) = 6.4, 
p = .015. In addition, a significant interaction effect for Condition X 
Gender X Incoming Mathematics Ability emerged, F(1, 50) = 5.1, 
p = .029. This suggests that the control condition, which offered no 
help, was especially detrimental for female students of low math 
ability but was still beneficial for high-ability female students. No 
version of the help study was better for male students’ learning at that 
point. 

Gender differences in perceptions and motivation. Table 5 
shows differences in student affect after using the Wayang tutor 
for Study 4 (affect study) in particular. We can observe significant 
gender differences that indicate that female students receiving the 
Jane character perceived the system significantly better than did 
those without a character (see Table 5, columns 4—6). These 
differences suggest that the no character condition generated better 
perceptions of the software for male subjects than for female 
subjects (column 3, row 1). It appears that female students per- 
ceived the tutor more positively when learning companions were 
present but that the opposite was true for male students, who 
clearly preferred the absence of the learning companion instead of 


Table 5 


the presence of the female character. Interestingly, female students 
also perceived the system better when receiving the male character 
Jake than when receiving no character at all (column 8, row 1). 

A general trend for the Jane character in particular suggests that 
students in general increased their self-concept of their ability to 
do mathematics and mathematics liking when receiving this char- 
acter (see Table 5, column 4, rows 2 and 3). However, a more 
detailed analysis separating by gender (columns 5 and 6) suggests 
that this difference is due to a significant benefit for female 
students but not for male students. 


Discussion and Conclusions 


Although students of each gender had similar incoming math- 
ematics ability, high school girls consistently reported lower con- 
fidence and higher frustration and anxiety toward mathematics at 
pretest time. On the other hand, middle school girls reported more 
interest and excitement toward math than did middle school boys, 
and they did not differ in confidence, frustration, or anxiety mea- 
surements. In contrast to these incoming factors, a variety of 
significant indicators suggest that female students particularly ben- 
efited from the Wayang Outpost tutor, both affectively and cog- 
nitively, and particularly when a female affective learning charac- 
ter was present. The results suggest a general advantage of 
affective learning companions for several affective outcomes for 
female students: reduced frustration; increased excitement; and 
increased perception of the software, self-efficacy in mathematics, 
and liking of mathematics. 

In general, students reported significantly more interest (less 
boredom) when learning companions were used than when no 
learning companions were used. At the same time, students who 
received the female learning companion reported significantly 
higher self-concept and liking of mathematics at posttest time, 
although it seems that the difference is mostly due to a benefit for 
female students. Students who received the female learning com- 
panion also reported higher confidence toward problem solving in 
post-tutor surveys. Posttest excitement among female students was 
higher for those who worked with companions than for those who 
used no tutor; in contrast, excitement among male students was 
higher when companions were absent, and they quick-guessed less 
when characters were absent. 


Study 4 (Affect): Post-Tutor Differences in Student’s Perception of the Learning Experience Between Experimental and Control 


Conditions 


—_———_—_:.:.._ ee eee n—n—n eae 


LC vs. control 


All Female Male 
Posttest perception subjects subjects subjects 
1. Perception of software 0.03 0.82" —0.65* 
2. Self-efficacy in mathematics 0.25 0.37 0.14" 
3. Liking of mathematics 0.06 0.38 —0.14 


Female character vs. control Male character vs. control 


All Female Male All Female Male 
subjects subjects subjects subjects subjects subjects 
0.09 0.89" —0.70° 0.23 0.78" = OS 
0.50° 0.30" 0.16 =O i7) —0.02 —0.14 
0.49* 0.34" 0.35 0.04 0.07 0.02 


EE I 
Note. Effect sizes (Cohen’s d) are reported for the learning companion condition, either the female or the male character (columns 2-4); for the female 
character condition (columns 5-7); and for the male character condition (columns 8-10). Within each of these, experimental vs. control conditions are 
considered for all subjects first, then for female subjects alone, and then for male subjects alone. Significance level corresponds to the between-subjects 
Fisher test for the corresponding analysis of variance. A positive number indicates that the mean was higher for the experimental condition; a negative 
number indicates a higher mean for the control without affective characters. Numbers in bold type indicate significant values. LC = learning companion 
and includes the female and male characters. 


* Marginally significant at p< .1. ~* Significant at p < .05. ** Significant at jy = ANN 
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The Wayang tutor provided individualized and adaptive mech- 
anisms for problem selection along with animated companions. 
One pedagogical approach was to guarantee student success by 
adjusting the difficulty of selected problems before moving on to 
harder problems, as specified in Arroyo, Mehranian, and Woolf 
(2010). As far as the studies reported here, the learning compan- 
ions in particular appear to have an important impact on female 
students, with marginal benefits of the male character on male 
students. This advantage of same-sex matching can be explained in 
two ways: Either students developed self-efficacy via role model- 
ing, Or messages came through because of higher intimacy due to 
age-related same-sex friendships. Additionally, girls perceived the 
entire learning experience with Wayang significantly better than 
did boys, in particular when learning companions were present, 
whereas the opposite was true for boys, who reported better 
perception of learning when the companions were absent. Gender 
differences were also observed on posttests after students worked 
with the tutor, specifically in that girls reported higher confidence 
and lower frustration than did boys, in all conditions. Modeling 
and responding to student gender within intelligent tutors is par- 
ticularly powerful, as it can improve teaching and personalize 
instruction at a very low cost. 

We observed behavioral gender differences while students in- 
teracted with the tutor, even without digital companions, suggest- 
ing that girls accepted more help (and thus tended to learn more), 
had more productive behaviors conducive to learning (e.g., spent 
more time with help aids), and showed reduced disengagement 
(e.g., boys engaged in gaming more frequently, girls made fewer 
quick-guesses). However, girls engaged in more frequent quick- 
guessing when the male character was assigned to them. 

Several reasons may be suggested for the less than optimal 
behavior of male students with the tutor. It is possible that male 
students avoid requesting or accepting help and hints because they 
are protective of their self-efficacy and they blame the tutor 
(external attributions) for their mistakes, thus adopting an avoid- 
ance strategy toward the software (Wigfield, 1988). This produces 
a suboptimal self-regulation strategy, as a more effective strategy 
would be to request hints on problems that they cannot solve. 

Other research with educational technology supports the differ- 
ential impact of technology on the two genders. For instance, 
Burleson and Picard (2007) found that female students experi- 
enced reduced frustration with their interactive activities involving 
pedagogical agents more than did male students. In an Australian 
study, first-year college students benefited from the use of online 
feedback, but male students chose not to complete the feedback 
session as often as did female students (Sanders et al., 2007). 
When the feedback session was shortened, male students’ involve- 
ment increased and subjects who engaged with the feedback did 
improve their test scores. In related work ( Gunn, French, McLeod, 
McSporran, & Conole, 2002), male subjects studying computer 
science were not as self-aware of their need for formative assess- 
ment as were their female counterparts. 

This research highlights how to best support female students in 
intelligent learning environments, but it leaves open questions 
about how to support male students and the reasons for these 
differences. The beneficial result of characters for female students 
can be explained with social cognitive theory, which suggests that 
self-efficacy and self-regulation are related. If students feel a 
higher self-concept while learning, they should tend to self- 


regulate better and have more productive behaviors and percep- 
tions of the software and themselves as learners. Animated com- 
panions might have acted as role models to support self-efficacy 
by modeling, as suggested by Bandura and Bussey (2004). Lastly, 
by talking about myths in mathematics and reflecting about the 
meaning of making errors and the importance of perseverance, 
gendered companions apparently encouraged students to improve 
confidence in their abilities, thus raising their belief that they had 
what it takes to succeed. 

Research such as described here may ultimately lead to nuanced 
recommendations about the type of individual support to provide 
for each gender and student’s mathematics ability. Perhaps female 
students should work with female learning companions and male 
students should receive a male learning companion. Perhaps high- 
achieving male students should receive no learning companion at 
all. These recommendations cannot be made from a single exper- 
iment, but persistent results over 10 years, such as provided in this 
article, begin to provide a persuasive argument about the need for 
differential support for male and female students. 

Although these results suggest some accommodations, we 
should be careful about making sweeping conclusions (e.g., that 
male students should never receive any kind of learning compan- 
ions). In fact, there is evidence that low-achieving students (both 
male and female) benefited from affective learning companions 
(Woolf et al., 2010). These findings suggest that high-achieving 
male students did not benefit from learning companions in general 
and, in particular, that the female character was detrimental to 
several outcomes and behaviors. Further studies are needed about 
gender differences as students interact with advanced technology 
in other contexts and domains to validate the results provided here 
and to suggest specific strategies that work for male students in 
general. 
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A Meta-Analysis of the Effectiveness of Intelligent Tutoring Systems on 


K-12 Students’ Mathematical Learning 


Saiying Steenbergen-Hu and Harris Cooper 
Duke University 


In this study, we meta-analyzed empirical research of the effectiveness of intelligent tutoring systems 
(ITS) on K-12 students’ mathematical learning. A total of 26 reports containing 34 independent samples 
met study inclusion criteria. The reports appeared between 1997 and 2010. The majority of included 
studies compared the effectiveness of ITS with that of regular classroom instruction. A few studies 
compared ITS with human tutoring or homework practices. Among the major findings are (a) overall, 
ITS had no negative and perhaps a small positive effect on K-12 students’ mathematical learning, as 
indicated by the average effect sizes ranging from g = 0.01 to g = 0.09, and (b) on the basis of the few 
studies that compared ITS with homework or human tutoring, the effectiveness of ITS appeared to be 
small to modest. Moderator analyses revealed 2 findings of practical importance. First, the effects of ITS 
appeared to be greater when the interventions lasted for less than a school year than when they lasted for 
1 school year or longer. Second, the effectiveness of ITS for helping students drawn from the general 
population was greater than for helping low achievers. This finding draws attentions to the issue of 
whether computerized learning might contribute to the achievement gap between students with different 
achievement levels and aptitudes. 


Keywords: intelligent tutoring systems, effectiveness, mathematical learning, meta-analysis, achievement 


Intelligent tutoring systems (ITS) are computer-assisted learning 
environments created using computational models developed in 
the learning sciences, cognitive sciences, mathematics, computa- 
tional linguistics, artificial intelligence, and other relevant fields. 
ITS often are self-paced, learner-led, highly adaptive, and interac- 
tive learning environments operated through computers. ITS are 
adaptive in that they adjust and respond to learners with tasks or 
steps to suit learners’ individual characteristics, needs, or pace of 
learning (Shute & Zapata-Rivera, 2007). 

ITS have been developed for mathematically grounded aca- 
demic subjects, such as basic mathematics, algebra, geometry, and 
statistics (Cognitive Tutor: Anderson, Corbett, Koedinger, & Pel- 
letier, 1995; Koedinger, Anderson, Hadley, & Mark, 1997; Ritter, 
Kulikowich, Lei, McGuire, & Morgan, 2007; AnimalWatch: Beal, 
Arroyo, Cohen, & Woolf, 2010; ALEKS: Doignon & Falmagne, 
1999); physics (Andes, Atlas, and Why/Atlas: VanLehn et al., 
2002, 2007); and computer science (dialogue-based intelligent 
tutoring systems: Lane & VanLehn, 2005; ACT Programming 
Tutor: Corbett, 2001). Some ITS assist with the learning of reading 
(READ 180: Haslam, White, & Klinge, 2006; iSTART: McNa- 
mara, Levinstein, & Boonthum, 2004), writing (R-WISE writing 
tutor: Rowley, Carlson, & Miller, 1998), economics (Smithtown: 
Shute & Glaser, 1990), and research methods (Research Methods 
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Tutor: Arnott, Hastings, & Allbritton, 2008). There are also ITS 
for specific skills, such as metacognitive skills (see Aleven, 
McLaren, & Koedinger, 2006; Conati & VanLehn, 2000). The use 
of ITS as an educational tool has increased considerably in recent 
years in U.S. schools. Cognitive Tutor by Carnegie Learning, for 
example, was used in over 2,600 schools in the United States as of 
2010 (What Works Clearinghouse, 2010a). 

ITS are developed so as to follow the practices of human tutors 
(Graesser, Conley, & Olney, 2011; Woolf, 2009). They are ex- 
pected to help students of a range of abilities, interests, and 
backgrounds. Research suggests that expert human tutors can help 
students achieve learning gains as large as two sigmas (Bloom, 
1984). Although not as effective as what Bloom (1984) found, a 
recent meta-review by VanLehn (2011) found that human tutoring 
had a positive impact of d = 0.79 on students’ learning. 

ITS track students’ subject domain knowledge, learning skills, 
learning strategies, emotions, or motivation in a process called 
student modeling at a level of fine-grained detail that human tutors 
cannot (Graesser et al., 2011). ITS can also be distinguished from 
computer-based training, computer-assisted instruction (CAI), and 
e-learning. Specifically, given their enhanced adaptability and 
power of computerized learning environments, ITS are considered 
superior to computer-based training and CAI in that ITS allow an 
infinite number of possible interactions between the systems and 
the learners (Graesser et al., 2011). VanLehn (2006) described ITS 
as tutoring systems that have both an outer loop and an inner loop. 
The outer loop selects learning tasks; it may do so in an adaptive 
manner (i.e., select different problem sequences for different stu- 
dents), on the basis of the system’s assessment of each individual 
student’s strengths and weaknesses with respect to the targeted 
learning objectives. The inner loop elicits steps within each task 
(e.g., problem-solving steps) and provides guidance with respect to 
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these steps, typically in the form of feedback, hints, and error 
messages. In this regard, as VanLehn (2006) noted, ITS are dif- 
ferent from CAI, computer-based training, or web-based home- 
work in that the later lack of an inner loop. ITS are one type of 
e-learning that can be self-paced or instructor directed, encompass- 
ing all forms of teaching and learning that are electronically 
supported, through the Internet or not, in the form of texts, images, 
animations, audios, or videos. 

The growth of ITS and the accumulation of evaluation research 
Justify a meta-analysis of the effectiveness of ITS on students’ 
mathematical learning for the following three reasons. First, sev- 
eral reviews of the impact of ITS on reading already exist (Becker, 
1992; Blok, Oostedam, Otter, & Overmaat, 2002; Kulik, 2003). 
Most recently, Cheung and Slavin (2012) reviewed the effects of 
educational technology on K—12 students’ reading achievement, 
relative to traditional instructional methods. They found an aver- 
age standardized mean difference of 0.16 favoring the educational 
technology. No similar review regarding ITS with a focus on math 
has been carried out. 

Second, much research on the effectiveness of math ITS has 
accumulated over the last two decades. Without rigorous summa- 
rization, this literature appears confusing in its findings. For ex- 
ample, Koedinger et al. (1997) found that students tutored by 
Cognitive Tutor showed extremely high learning gains in algebra 
compared with students who learned algebra through regular class- 
room instruction. Shneyderman (2001) found that, on average, 
students who learned algebra through Cognitive Tutor scored 0.22 
standard deviations above their peers who learned algebra in 
traditional classrooms but only scored 0.02 standard deviations 
better than their comparison peers on the statewide standardized 
test. However, Campuzano, Dynarski, Agodini, and Rall (2009) 
found that sixth grade students who were taught math with regular 
classroom instruction throughout a school year outperformed those 
who were in regular class 60% of the school year and spent the 
other 40% of class time learning math with ITS, indicated by an 
effect size of —0.15. Thus, there is a need to gather, summarize, 
and integrate the empirical research on math ITS, to quantify their 
effectiveness and to search for influences on their impact. 

Third, there has been increased attention in recent years on the 
effectiveness of math ITS for students’ learning. The What Works 
Clearinghouse (WWC) has completed several evidence reviews on 
some math ITS products. For example, the WWC produced four 
reviews on Carnegie Learning’s Cognitive Tutor (i.e., the WWC, 
2004, 2007, 2009, 2010a). The WWC also reviewed the evidence 
on Plato Achieve Now (see WWC, 2010b). The WWC reviews, 
however, did not include all math ITS. Our literature search 
identified more than a dozen intelligent tutoring system products 
designed to help students’ mathematical learning. And, important 
for our effort, the WWC reviews did not examine factors that 
might influence the direction and magnitude of the ITS effect. In 
contrast, our effort does not focus on specific intelligent tutoring 
system programs but on their general effectiveness and on the 
factors that moderate their impact. 

In sum then, a number of questions regarding the effectiveness 
of ITS can be addressed by a meta-analysis. Most broadly, it can 
estimate the overall average effectiveness of ITS relative to other 
types of instruction on students’ mathematical learning. But more 
specific questions also may be answerable. For example, a meta- 
analysis can explore what kind of settings ITS work best in, as well 


as for what types of student populations. By using information 
across as well as within primary studies, a meta-analysis provides 
a useful quantitative strategy for answering these questions. 


Method 


Study Inclusion and Exclusion Criteria 


For studies to be included in this meta-analysis, the following 
eight criteria had to be met: 


1. Studies had to be empirical investigations of the effects 
of ITS on learning of mathematical subjects. Secondary 
data analyses and literature reviews were excluded. 


2. Studies had to be published or reported during the period 
from January 1, 1990, to June 30, 2011, and had to be 
available in English. 


3. Studies had to focus on students in grades K—12, includ- 
ing high achievers, low achievers, and remedial students. 
However, studies focusing exclusively on students with 
learning disabilities or social or emotional disorders (e.g., 
students with attention-deficit/hyperactivity disorder) 
were excluded. 


4. Studies had to measure the effectiveness of ITS on at 
least one learning outcome. Common measurements in- 
cluded standardized test scores, modified standardized 
test scores, course grades, or scores on tests developed by 
researchers. 


5. Studies had to have used an independent comparison 
group. Comparison conditions included regular class- 
room instruction, human tutoring, or homework. Studies 
without a comparison group or those with one-group 
pretest—posttest designs were excluded. 


6. Studies had to use randomized experimental or quasi- 
experimental designs. If a quasi-experimental design was 
used, evidence had to be provided that the treatment and 
comparison groups were equivalent at baseline (see 
WWC, 2008). Studies with a significant difference be- 
tween the treatment and comparison groups prior to the 
ITS intervention were excluded, unless information was 
available for us to calculate effect sizes that would take 
into account the prior difference. 


7. Studies had to have at least eight subjects in treatment 
and comparison groups, respectively. Studies with sam- 
ple sizes less than eight in either group were excluded. 


8. Studies had to provide the necessary quantitative infor- 
mation for the calculation or estimation of effect sizes. 


Study Search 


We used the following procedures to locate studies: (a) a search 
of abstracts in electronic databases including ERIC, PsycINFO, 
Proquest Dissertation and Theses, Academic Search Premier, 
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Econlit With Full Text, PsycARTICLES, SocINDEX With Full 
Text, and Science Reference Center; (b) Web searches using the 
Google and Google Scholar search engines; (c) a manual exami- 
nation of reference and bibliography lists of the relevant studies; 
and (d) personal communications with 18 ITS research experts 
who had been the first author on two or more ITS studies during 
the past 20 years. 

We used a wide variety of search terms to ensure our searches 
would identify as many relevant studies as possible. Although 
some researchers have used the term intelligent tutoring systems, 
many others have used a wide variety of alternative terms, for 
example, computer-assisted tutoring, computer-based tutoring, ar- 
tificial tutoring, or intelligent learning environments. Therefore, 
we also used the terms intelligent tutor’, artificial tutor’, computer 
tutor’, computer-assisted tutor’, computer-based tutor’, intelligent 
learning environment’, computer coach", online-tutor’, keyboard 
tutor’, e-tutor’, electronical tutor’, and web-based tutor’. After 
concluding these searches, we began to focus on math ITS. 

We found that some math ITS studies could not be retrieved 
through the search keywords above and some ITS studies are 
locatable only through the use of particular ITS names. The ref- 
erence list of Graesser et al.’s (2011) introduction to ITS, for 
instance, indicates that large numbers of studies are exclusively 
connected with particular ITS programs, such as Cognitive Tutor, 
AutoTutor, or CATO. Dynarski et al. (2007) evaluated the effec- 
tiveness of three mathematical educational software programs for 
sixth graders (i.e., Larson Pre-Algebra, Achieve Now, and iLearn 
Math) and three software programs for ninth graders (i.e., Cogni- 
tive Tutor Algebra, Plato Algebra, and Larson Algebra). All of 
these educational software programs were actually ITS products. 
However, we found our previous search only caught studies of 
Cognitive Tutor, the most widely used and studied ITS, and we 
missed all studies of the other software. This was also the case for 
the educational software evaluated by Campuzano et al. (2009). 
Therefore, we used the names of some major software programs 
reported in Graesser et al. (2011), Dynarski et al. (2007), and 
Campuzano et al. (2009) and conducted a third search in ERIC and 
PsycINFO. No new qualified studies were found. However, by 
screening all of the studies in the WWC reviews of Cognitive 
Tutor and Plato Achieve Now (i.e., WWC, 2004, 2007, 2009, 
2010a, 2010b), we found five additional studies that qualified for 
inclusion. In summary, our search concluded with 26 qualified 
reports evaluating the effectiveness of ITS on K-12 students’ 
mathematical learning. 


Study Coding 


We designed a detailed coding protocol to guide the study 
coding and information retrieval. The coding protocol covered 
studies’ major characteristics, which included (a) the basic features 
of the study reports (e.g., whether the study was published or 
unpublished and when it was conducted), (b) research design 
features (e.g., sample sizes; whether the study used a randomized 
or quasi-experimental design), (c) the contexts of intervention 
(e.g., subject matter; whether the study compared ITS with regular 
classroom instruction, human tutors, or other education interven- 
tions; the duration of ITS intervention), and (d) the study outcomes 
(e.g., what and how outcomes were measured; when the assess- 
ments took place; the magnitudes and direction of the effect sizes). 


Two coders independently coded the major features of each study, 
except the study outcomes, and then met together to check the 
accuracy of the coding. If there was a disagreement in coding, the 
two coders discussed and reexamined the studies to settle on 
the most appropriate coding. If the disagreement could not be 
resolved, the second author was consulted. The first author coded 
the study outcomes and then discussed the codes with the second 
author. The major specific variables coded are described later 
along with the study results. 


Effect Size Calculation 


We used Hedges’ g, a standardized mean difference between two 
groups, as the effect size index for this meta-analysis. The prefer- 
ence for Hedges’g over other standardized-difference indices, such 
as Cohen’s d and Glass’s A, is due to the fact that Hedges’ g can be 
corrected to reduce the bias that may arise when the sample size is 
small (i.e., 2 < 40; Glass, McGaw, & Smith, 1981). Hedges’ g was 
chosen for this meta-analysis because the samples in many ITS 
studies are small. 

Hedges’g was calculated by subtracting the mean of the com- 
parison condition from that of the ITS tutoring condition and 
dividing the difference by the average of the two groups’ standard 
deviations. A positive g indicates that students tutored by ITS 
achieved more learning gains than did those in the comparison 
condition. In cases for which only inference test results were 
reported but no means and standard deviations were available, g 
was estimated from the inferential statistics, such as t, F, or p 
values (Wilson & Lipsey, 2001). For studies that did not report 
specific values of inferential statistics, we assumed a conservative 
value for effect size calculation. For example, if a study reported 
a Statistically significant difference between the ITS and the com- 
parison condition with p < .01, we assumed a p value of .01 for 
effect size calculation.' 

We calculated unadjusted effect sizes for a study if it only 
reported the ITS and comparison groups’ mean posttest scores, 
standard deviation, and sample sizes. Unadjusted effect sizes did 
not take into account other variables that might have had an impact 
on the outcomes. For some studies, in addition to unadjusted effect 
sizes, adjusted effect sizes were also extracted. We called them 
adjusted effect sizes because they were calculated after adjusting or 
controlling for other variables, such as pretest scores. In some 
cases, adjusted effect sizes were based on means and standard 
deviations of gain scores (i.e., posttests — pretests), whereas in 
other cases they were based on covariance-adjusted means and 
standard deviations. For studies that reported descriptive statistics 
of both pretests and posttests, as suggested by D. B. Wilson 
(personal communication, April 18, 2011), adjusted effect sizes 
were the differences between posttest and pretest effect sizes and 
their variances were the sum of posttest and pretest effect sizes 
variances. 


' This was the case for only one study (i.e., Shneyderman, 2001), in 
which the effect size for one of the three outcomes was calculated by 
assumed p = .01 when the study reported p < .01. Because the effect size 
representing this study was an average of all three effect sizes from three 
outcomes, there was a minimal possibility that this would lead to an 
underestimation of the overall effect sizes in this meta-analysis. 
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Independent Studies, Samples, and Effect Sizes 


To address effect size dependency issues, we used independent 
samples as the unit of analysis. Each independent sample is not the 
equivalent of a separate research report. One report could contain 
two or more independent studies. For example, we coded Beal et 
al. (2010) as two independent studies, each based on a different 
sample. The 26 reports contained 34 independent studies based on 
34 independent samples. Table | presents the major features of all 
31 studies in which ITS were compared with regular classroom 
instruction. 

We used a shifting unit of analysis approach (Cooper, 2010) to 
further address possible dependencies among effect sizes. The 
benefits of the shifting unit of analysis approach are that it allows 
us to retain as much data as possible while ensuring a reasonable 
degree of independence among effect sizes. With this approach, 
effect sizes were first extracted for each outcome as if they were 
independent. For example, if a study with one independent sample 
used both a standardized test and a course grade to measure 
students’ learning, two separate effect sizes were calculated. When 
estimating the overall average effect of ITS, these two effect sizes 
were averaged so that the sample contributed only one effect size 
to the analysis. However, when conducting a moderator analysis to 
investigate whether the effects of ITS vary as a function of the type 
of outcome measures, this sample contributed one effect size to the 
category of standardized test and one to that of course grade. 


Data Analysis 


We used the Comprehensive Meta-Analysis (Borenstein, 
Hedges, Higgins, & Rothstein, 2006) software for data analysis. 
Before the analyses, we conducted Grubbs (1950) tests to examine 
whether there were statistical outliers among the effect sizes and 
sample sizes. We conducted the meta-analysis using a weighting 
procedure and with both fixed-effect and random-effects models 
(Cooper, 2010). A fixed-effect model functions with the assump- 
tion that there is one true effect in all of the studies included in a 
meta-analysis and the average effect size will be an estimate of that 
value. A fixed-effect model is suited to drawing conditional infer- 
ences about the observed studies. However, it is less well suited to 
making generalizations to the population of studies from which the 
observed studies are a sample (Konstantopoulos & Hedges, 2009). 
A random-effects model assumes that there is more than one true 
effect and the effect sizes included in a meta-analysis are drawn 
from a population of effects that can differ from each other. 

Two approaches were used to assess publication bias. First, a 
funnel plot was visually inspected. The suggestion of missing 
studies on the left side of the distribution indicated the possible 
presence of publication bias. Duval and Tweedie’s (2000) trim- 
and-fill procedure (available as part of the Comprehensive Meta- 
Analysis software) was then used to further assess and adjust for 
publication bias. Through this procedure, unmatched observations 
were removed (trimmed) from the data distribution and additional 
values were imputed (filled) for projected missing studies. Then, 
average effect sizes are recalculated. 


Moderator Analyses 


Testing for moderators was conducted on the groups of effect 
sizes that had a high degree of heterogeneity (Cooper, Hedges, & 


Valentine, 2009). The purpose of testing for moderators was to 
identify variables associated with certain features of the primary 
studies that might be significantly associated with the effectiveness 
of ITS. 


Results 


The literature search located 26 reports that met our study 
inclusion criteria. The reports appeared between 1997 and 2010. 
The sample sizes in the reported studies ranged from 18 to 17,164. 
The 26 reports provided 65 effect sizes. Forty-seven effect sizes 
were unadjusted, meaning they were calculated from posttest out- 
come measures and did not control for variables other than the ITS 
treatment, which might have influenced the outcome measures; 18 
were adjusted effect sizes, which were calculated after controlling 
for other confounding variables, such as pretest scores. 

As mentioned in the Method section, to address effect size 
dependency issues, we used independent studies (i.e., samples) as 
the unit of analysis. The 26 reports contained 34 independent 
studies based on 34 independent samples. Of the 34 independent 
studies, 31compared ITS with regular classroom instruction. These 
31 studies provided 61 effect sizes (see Table 1). In general, these 
31 independent studies compared learning outcomes of instruc- 
tions with an ITS component to those without one. Specifically, 
this comparison refers to four types of comparison situations. First, 
a large portion of the studies, for example, studies of Cognitive 
Tutor compared the learning gains of students who learned through 
instruction in which Cognitive Tutor was a significant part of 
regular classroom instruction with the learning gains of students 
who learned through traditional classroom instruction in which no 
Cognitive Tutor was involved. In such studies, interventions usu- 
ally lasted for one school year or one semester during which 
students in the experimental groups generally spent 60% of their 
time in regular classroom learning and 40% of their time in the 
computer lab using Cognitive Tutor; students in the control groups 
spent 100% of their time in regular classrooms. Second, some 
studies compared students who learned solely through using ITS 
with those who learned in regular classroom instruction (e.g., 
Arroyo, Woolf, Royer, Tai, & English, 2010; Beal et al., 2010, 
Study 2; Walles, 2005). Interventions in these studies usually 
lasted for just a few days. Third, two studies compared students’ 
learning in conditions in which ITS partially took teacher’s re- 
sponsibilities (e.g., giving students guidance or feedback) to stu- 
dents’ learning in conditions in which they received guidance or 
feedback from teachers (i.e, Hwang, Tseng, & Hwang, 2008; 
Stankov, Rosic, Zitko, & Grubisic, 2008). Interventions in these 
studies lasted for several weeks to one semester. Last, one study 
compared students who used ITS as a supplement in addition to 
regular classroom instruction with students who learned through 
regular classroom instruction without using ITS as a supplement 
(i.e., Biesinger & Crippen, 2008). Intervention in this study lasted 
for one semester. Because the comparison conditions in all four 
types of situations above involved either regular classroom instruc- 
tion or teachers’ efforts, we grouped them together as ITS being 
compared with regular classroom instruction. 

Two independent studies (i.e., Mendicino, Razzaq, & Heffer- 
nan, 2009; Radwan, 1997) provided information on the effects of 
ITS on mathematical learning relative to that of homework assign- 
ments. One independent study (i.e., Beal et al., 2010) compared 
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ITS with human tutoring. We narratively reported the results of the 
studies that compared ITS with human tutors or home work 
conditions later in this section. We did not include them in the 
analyses described below so as to have a single clear comparison 
group. Therefore, the 61 effect sizes of ITS in comparison to 
regular classroom instruction made up the data for the results that 
follow. 

With the 61 effect sizes, we formed three different data sets. The 
first data set included unadjusted overall effect sizes. It consisted 
of 26 effect sizes with each independent sample contributing one 
overall effect size to the data set. Here, if multiple effect sizes were 
extracted from the same sample, these effect sizes were averaged 
to estimate the overall effectiveness of ITS on this independent 
sample. The second data set included all unadjusted effect sizes. 
This data set consisted of all 44 unadjusted effect sizes from the 26 
independent samples. The third data set consisted of 17 adjusted 
overall effect sizes from 17 independent samples. Some indepen- 
dent samples provided both an unadjusted and an adjusted overall 
effect size, whereas some only provided one type of overall effect 
sizes or the other. 

We conducted analyses on adjusted and unadjusted effect sizes 
separately. One may argue that it would be beneficial to pool the 
two types of effect sizes so that the analyses would include all of 
the 31 effect sizes from the 31 studies. However, we think the 
benefits of conducting the analyses separately outweigh those of 
analyzing them together. We have three justifications. First, dis- 
tinguishing adjusted and unadjusted effect sizes would allow us to 
examine whether estimates of ITS effectiveness differs with or 
without controlling for confounding factors. Second, the number 
of studies in the analyses did not increase significantly even if we 
analyze the effect sizes together. Specifically, if the effect sizes 
were analyzed together, the total number of studies would be 
increased from 26 (i.e., the number of unadjusted effect sizes) to 
31 (i.e., the total number of independent studies or samples). 
Finally, analyzing the two types of effect sizes separately helps in 
interpretation. Differentiating adjusted and unadjusted effect sizes 
and integrating them separately allows us to provide clearer infor- 
mation regarding what each estimate of effect refers to with regard 
to the achievement outcome. 

We conducted Grubbs (1950) tests to look for statistical outliers 
before calculating the average effect sizes. The Grubbs tests 
showed that, among the unadjusted overall effect sizes (k = 26), 
one effect size (g = —1.57) appeared to be an outlier (i.e., Plano, 
Ramey, & Achilles, 2007). We found that the Plano et al. (2007) 
study provided information for both an adjusted (adjusted by 
pretest scores, g = —Q.48) and an unadjusted effect size (g = 
—1.57). Clearly then, the unadjusted effect size was strongly 
impacted by the preexisting differences between the treatment and 
comparison groups. We reset the effect size to — 0.66, its next 
nearest neighbor among the unadjusted overall effect sizes. Among 
all the unadjusted effects sizes (k = 44), the effect size (g = 
—1.57) from the Plano et al. (2007) study again appeared to be an 
outlier. We reset the effect size to — 1.03, its next nearest neighbor. 
The Grubbs tests detected no outliers among the adjusted overall 
effect sizes. We also conducted Grubbs tests on ITS and compar- 
ison group sample sizes. Again, we reset the outlier sample sizes 
to their nearest neighbors. We conducted analyses after adjusting 
the outlier sample sizes.” 


As Table 1 shows, 10 different ITS were studied in the 31 
independent studies comparing ITS with regular classroom in- 
struction. Cognitive Tutor by Carnegie Learning was the most 
frequently studied. Specifically, Cognitive Tutor for algebra learn- 
ing was evaluated in 16 studies; Cognitive Tutor for math was 
studied in three studies; in one study, Cognitive Tutor was used for 
geometry. As mentioned in the introduction, the WWC had com- 
pleted four reviews on the effectiveness of Cognitive Tutor. Four 
other ITS (ie., Larson Pre-Algebra/Algebra I, Achieve Now, 
iLearn Math, and Plato Algebra) were evaluated in a national-level 
study (see Campuzano et al., 2009; Dynarski et al., 2007) and were 
also reviewed by the WWC. The other two ITS that were relatively 
frequently studied were Wayang Outpost (see Arroyo et al., 2010; 
Beal et al., 2010; Walles, 2005) and Tutor-Expert System (see the 
three studies by Stankov, Rosic, Zitko, & Grubisic, 2008). Online 
Remediation Software appeared in two studies by Biesinger and 
Crippen (2008). AnimalWatch (Beal, Arroyo, Cohen, & Woolf, 
2007) and Intelligent Tutoring, Evaluation and Diagnosis (Hwang 
et al., 2008) each appeared once. 

Because Cognitive Tutor was most frequently studied, we 
briefly describe its mechanism, scope of use, and the length of its 
implementation. Cognitive Tutor is built on a cognitive theory 
called adaptive control of thought (Anderson et al., 1995). Cogni- 
tive Tutor presents students with a series of problems and adap- 
tively identifies a student’s problem-solving strategy through his 
or her actions and comparisons with correct solution approaches 
and misconceptions generated by the program’s cognitive model, 
a process called model tracing. Five curricula have been developed 
with Cognitive Tutor as their software component and have been 
used by more than 500,000 students in approximately 2,600 
schools across the United States as of 2010 (WWC, 2010a). They 
are Bridge to Algebra, Algebra I, Geometry, Algebra II, and 
Integrated Math. In these curricula, students generally spend three 
class periods per week in regular classroom learning and two class 
periods in computer lab using Cognitive Tutor. In most of the 
evaluation studies included in this meta-analysis, students used 
Cognitive Tutor for one school year or one semester. 


Overall Effectiveness of ITS on Students’ 
Mathematical Learning 


We conducted meta-analyses on the data sets of unadjusted and 
adjusted overall effect sizes to examine the overall effectiveness of 
ITS on students’ mathematical learning, compared with that of 
regular classroom instruction. All effect sizes were weighted by 
inverse variances. Of the 26 unadjusted overall effect sizes, 17 
were in a positive direction, eight were in a negative direction, and 
one was exactly 0. Under a fixed-effect model, the average effect 
size was 0.05, 95% CI [.02, .09], p = .005, and was significantly 
different from 0. Under a random-effects model, the average effect 
size was 0.09, 95% CI [—.03, .20], p = .136, and was not 
significantly from 0. There was a high degree of heterogeneity 


* Tt is worth mentioning that we also calculated the average effect sizes 
and ran moderator analyses on the effect sizes without adjusting the outlier 
sample sizes. We found very minor differences in the analysis results 
between those with and without adjusted outlier sample sizes. These 
differences were not sufficient to lead to any major changes in conclusions. 
Therefore, we choose to only report the analysis results with the sample 
size outliers adjusted. 
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among the 26 unadjusted overall effect sizes, Q,(25) = 180.80, 
p = .000. This indicates that it was unlikely that sampling error 
alone was responsible for the variance among the effect sizes: 
instead, some other factors likely played a role in creating vari- 
ability as well. 

Of the 17 adjusted overall effect sizes, 10 were in a positive 
direction and seven were in a negative direction. Under a fixed- 
effect model, the average effect size was 0.01, 95% CI [—.04, .06], 
p = .792, and was not significantly different from 0. Under a 
random-effects model, the average effect size also was 0.01, 95% 
CI [—.10, .12], p = .829, and was also not statistically signifi- 
cantly different from 0. There was a high degree of heterogeneity 
among the 17 adjusted overall effect sizes, Q,(16) = 54.01, p = 
.000. Again, this suggested that it was unlikely that sampling error 
alone was responsible for the total variance among the effect sizes. 


Examining Publication Bias 


We conducted Duval and Tweedie’s (2000) trim and fill proce- 
dure to assess the possible effects of publication bias. For unad- 
justed overall effect sizes, there was evidence that three studies on 
the left side of the distribution might have been missing under both 
a fixed-effect model and a random-effects model. The overall 
average effect size after imputing the three additional values was 
0.04 under a fixed-effect model and was 0.03 under a random- 
effects model. The average effect size for the observed effect sizes, 
as reported previously, was 0.05 under a fixed-effect model and 
0.09 under a random-effects model. This implies that the average 
effect of ITS might have been slightly overestimated. 

For adjusted overall effect sizes, three studies on the left side of 
the distribution were projected as missing under a fixed-effect 
model, and six effect sizes on the left side of the mean were 
projected as missing under a random-effects model. The overall 
average effect sizes after imputing the three additional values 
ranged from —0.04 to —0.01 using a fixed-effect model; the 
overall average effect size after imputing the six additional values 
was —0.09 using a random-effects model. The average of the 
observed effect sizes, as reported previously, were 0.01 under both 
a fixed-effect model and random-effects model. Therefore, there 
was little evidence that publication bias had much impact on the 
average effect size in this case. 


Testing for Moderators on the Unadjusted and 
Adjusted Overall Effect Sizes 


We conducted moderator analyses exploring nine variables that 
could possibly have an impact on ITS’s effects. We chose these 
nine variables for two reasons. First, they represented important 
features of ITS intervention or research methodology. Second, 
there were at least two effect sizes associated with each of the 
category of the variable, in the data sets of both unadjusted and 
adjusted overall effect sizes, to allow meaningful analyses.* 

Tables 2 and 3 present the results of testing for moderators on 
the unadjusted and adjusted overall effect sizes, respectively. In 
each data set, the number of effect sizes involved might be differ- 
ent. For variables with more than two categories, we first con- 
ducted comparisons using all of the categories and then regrouped 
them to create a two-group comparison. We included the results of 
further analyses on the two-group comparison in the tables as well. 


Subject matter. Testing results showed that the effectiveness 
of ITS did not differ for different subject matters under a fixed- 
effect model, Q,(1) = .12, p = .726, nor did it differ under a 
random-effects model, Q,(1) = .62, p = .431, for unadjusted 
effect sizes. The advantage of using ITS, compared with regular 
classroom instruction, was significant only for basic math under 
the fixed-effect model, indicated by the fact that the confidence 
interval of the effect size (g = .06) did not contain 0. 

For adjusted effect sizes, results showed the effectiveness of ITS 
on students’ learning of basic math appeared to be greater than that 
of learning algebra under a fixed-effect model, Q,(1) = 9.10, p = 
003, but not under a random-effects model, 0,(1) = 1.62, p = 
.204.* Specifically, under a fixed-effect model, the average effec- 
tiveness of ITS was g = 0.11, 95% CI [.04, .19], on helping 
students learn basic math and g = —0.05, 95% CI [—.13, .02], on 
learning algebra. 

ITS duration. For unadjusted effect sizes, the effectiveness of 
ITS differed depending on the length of instruction under both a 
fixed-effect model, Q,(2) = 16.28, p = .000, and a random-effects 
model, Q,(2) = 6.42, p = .04. Further analyses revealed no 
difference between the short-term and one-semester ITS interven- 
tions. We therefore compared the combination of short-term and 
one-semester interventions with interventions of one school year 
or longer. We found that under a fixed-effect model, the average 
effectiveness of ITS was greater when the interventions lasted for 
less than one school year, g = 0.23, 95% CI [.13, .32], than when 
they lasted for one school year or longer, g = .02, 95% CI [—.02, 
.06]; under a random-effects model, the average effectiveness of 
ITS was also greater when the interventions lasted for less than one 
school year, g = 0.26, 95% CI [.08, .44], than that of when they 
lasted for one school year or longer, g = —.01, 95% CI [—.15, 
14). 

For adjusted effect sizes, results showed that ITS effectiveness 
also differed depending on the duration of intervention under both 
a fixed-effect model, Q,(2) = 13.88, p = .001, and a random- 
effects model, Q,(2) = 14.71, p = .001. Further analyses revealed 
that the effects differed depending on whether the ITS intervention 
lasted for one school (or longer) or less than one school year under 
both a fixed-effect model, Q,(1) = 6.48, p = .011, and a random- 
effects model, Q,(1) = 7.40, p = .007. 

Sample achievement level. We categorized study samples in 
terms of the academic achievement level of the subjects, using the 
way they were reported in the primary studies to categorize sam- 
ples. Two types of student samples appeared. One consists of 
general students, a population that includes students of all achieve- 
ment levels. Another consists of low achievers. There were three 
studies that reported results for low achievers (i.e., Biesinger & 
Crippen, 2008; Plano et al., 2007; Smith, 2001). For unadjusted 
effect sizes, under a fixed-effect model, the effectiveness of ITS on 


* The second reason Jed us to drop a number of variables we initially 
intended to study. For example, we hoped to compare whether there was a 
difference in the effects of ITS when they were used to substitute for 
regular classroom instruction and when they were used only as a supple- 
ment to regular classroom instruction. We were unable to do so because, 
for the 17 adjusted effect sizes, only one effect size was associated with 
ITS used as substitute, versus 16 effect sizes associated with ITS as a 
supplement. 

* For adjusted overall effect sizes, we dropped one effect size associated 
with geometry, g = —.19, 95% CI [—.34, —.04] 
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Table 2 (continued) 


Random 
95% CI 


Fixed 
95% CI 


Pb 


Pp 


Variable 


.082 


3.02 


[—.14, .15] 
[.01, 51] 


.000 
O01 
26 


17.84°*" 


[—.03, .04] 
[.14, .35] 


O1 
.25 


Sooner than end of school year 


Measurement timing (further analysis) 
End of school year 
Outcome type” 


a) 


5.36 


.002 


Los vas 


[.10, .56] 


33 


[125-55] 
[—.24, .45] 


ean 
10 


[—.21, .28] 


[lsh] 
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.03 


Se 


Course grades 


Course passing rates 


Specifically designed tests 


OS 


Modified standardized tests 


Standardized tests 
Outcome type (further analysis) 


02 


148 


2.09 


.000 


ISOs 


[.01, .39] 


.20 
.03 


27] 


19 Pid 


2 


16 
20 


Course-related outcome measures 


[—.02, .06] elie LG] 


Standardized test measures 


CI = confidence interval; Q, denotes the heterogeneity status between all subcategories of a particular variable under testing, with degrees of freedom equal to moderator levels minus one; 


ITS = intelligent tutoring system. 


Note. 


» Testing for moderators was conducted on the all unadjusted effect sizes so that total 


“Testing for moderators was conducted on the all unadjusted effect sizes so that total number of k exceeded 26. 


number of k exceeded 26. 


Spi 105: 


me Fr <=00l. 


piss 201! 


re 


helping general students learn mathematical subjects, g = .09, 
95% CI [.05, .12], was greater than on helping low achievers, g = 
=A 9S Fe Clofs:55, =.28],:0,0) = 46.13, p =*.000.. The 
difference was not significant under a random-effects model, 
Q,(1) = 0.60, p = .438. Overall, ITS appeared to have a positive 
impact on general students. For low achievers, the average effect 
was negative under both fixed-effect and random-effects models. 

For adjusted effect sizes, the effects of ITS on helping general 
students learn mathematical subjects, g = .04, 95% CI [—.02, .09], 
were greater than on helping low achievers, g = —.18, 95% CI 
[—.32, —.05], Q,(1) = 9.24, p = .002, under a fixed-effect model. 
The difference was not significant under a random-effects model, 
Q,(1) = 1.31, p = .253. In this analysis, the only average effect 
size that was significantly different from 0 was the negative effect 
indicating that regular classroom instruction compared favorably 
with ITS under a fixed-effect model. 

Schooling level. The unadjusted overall effect sizes were as- 
sociated with samples of three schooling levels: (a) elementary 
school, which included K—5 grade levels; (b) middle school, which 
included Grades 6—8; and (c) high school, which included Grades 
9-12.° The relative effectiveness of ITS on students’ mathematical 
learning did not vary significantly in terms of schooling level 
under either a fixed-effect model, Q,(2) = 2.11, p = .349, or a 
random-effects model, Q,(2) = 0.58, p = .749. We regrouped the 
effect sizes into elementary school and secondary school levels. 
Again, results showed that the difference was not significant under 
either a fixed-effect model, Q,(1) = 0.51, p = .473, or a random- 
effects model, Q,(1) = 0.37, p = .545. 

For the adjusted overall effect sizes, the effectiveness of ITS 
varied significantly in terms of schooling level under a fixed-effect 
model, Q,(2) = 14.29, p = .001, but not under a random-effects 
model, Q,(2) = 3.07, p = .215. The average effect sizes suggested 
that the effects of ITS might be most pronounced for students in 
elementary school, g = .41, 95% CI [—.01, .84], compared with 
those in middle school, g = .09, 95% CI [.01, .17] and in high 
school, g = —.09, 95% CI [—.17, —.02]. However, when the 
effect sizes were regrouped into elementary and secondary school 
levels, no statistically significant difference was found between 
them under either a fixed-effect model, Q,(1) = 3.61, p = .057, or 
a random-effects model, Q,(1) = 3.19, p = .074. 

Sample size. Among the 26 unadjusted overall effect sizes, 
nine were associated with sample sizes less than 200, 12 were 
associated with sample sizes over 200 but less than 1,000, and five 
were associated with sample sizes over 1,000. The unadjusted 
effectiveness of ITS corresponding to each of these three sample 
size categories varied significantly under a fixed-effect model, 
Q,(2) = 14.28, p = .001, but not under a random-effects model, 
Q,(2) = 0.70, p = .704. Further analyses revealed that the effects 
were greater when the study sample sizes were less than 200 than 
when the sample sizes were over 200 under a fixed-effect model, 
Q,(2) = 5.94, p = .015, but not under a random-effects model, 
Q,(2) = 0.74, p = .389. 


° We did not include three unadjusted and two adjusted effect sizes 
associated with studies in which samples were across both middle school 
and high school. It is also worthy to note that there were only three studies 
involved elementary school students and all of them were conducted by a 
same research team. 
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Table 3 

Testing for Moderators of the Adjusted Overall Effect Sizes 

Fixed Random 
Variable k g 95% CI OQ; Pp g 95% CI Q, Pp 

Subject 9.10" .003 1.62 204 
Basic math 9 ml [.04, .19] mila [e058] 
Algebra 7 =.05 eetliss2 02] ann (Sh (LOD mn 

ITS duration 13.88" 001 1 001 
One school year or longer* 10 —.02 [—.07, .04] —.08 lee20 57 05) 
One semester 2 .06 [—.13, .24] 06 [—.13, .24] 
Short term 5 py [.24, .79] Ee [.24, .79] 

ITS duration (further analysis) 6.48" O11 7.40" 007 
One school year of longer 10 an), [—.07, .04] Sn05 [—.20; .05] 
Less than one school year oh 19 [.04, .34] 29 [.06, .53] 

Sample achievement level 9.24™ 002 1.31 258) 
General students 14 04 [—:02; :09] 105) [065.06] 
Low achievers 3 = Jif = .32,, = {0s = I6 [—.49, .18] 

Schooling level 14.29" 001 3.07 EDS 
Elementary school 3 Al [-.01, .84] 42 [—.04, .89] 
Middle school 4 .09 [.01, .17] —.004 [20s I9] 
High school 8 =.09 fale 02] =O [-.19, .16] 

Schooling level (further analysis) 3.61 O57 Bn9 .074 
Elementary school 3 Al [—.01, .84] 42 [—.04, .89] 
Secondary school 14 001 [e050] aa) Hales oil] 

Sample size 12.49™" 002 DHT oll 
Less than 200 6 18 [-.06, .43] 24 (i591 
Over 200 but less than 1000 8 —.08 alGse ON] —.06 [—.20, .09] 
Over than 1,000 3 09 [.01, .16] .06 feeehOre 22) 

Sample size (further analysis) 2.03 .154 1.96 .162 
Less than 200 6 18 [—.06, .43] 24 ne ee) 
Over 200 1] —.001 le 05s 05) 02 [—.14, .09] 

Research design Ae 338 1.09 .296 
Quasi-experimental 9 —.06 eee OOS) ol [ee 65-50) 
True experimental 8 02 [—.04, .07] —.01 [= 1L08] 

Year of data collection 2.31 SS el .699 
Before 2003 3 —.08 [eee2 a 08) —.08 [—-205-08)| 
Between 2003-2005 4 .03 [-.04, .10] = 07) = Soar LO] 
Between 2006-2010 8 — 08) [=.12, .05] .00 [313514] 

Year of data collection (further analysis) 19 SiS 33 468 
Before 2006 7 01 =, Ss) —.08 F-26710] 
2006 and after 8 03) ele O5)] 004 als), gle4| 

Counterbalanced testing 8.89"" .003 44 507 
No 12 —.08 [—.16,—.004] = (74 
Yes 5) 08 [.01, .14] 06 1.09521] 

Report type .69 407 1.98 .160 
Peer-reviewed journal 6 —.04 leelee OSI 23 eile Al 
Nonjournal 11 02 [—.04, .07] =.08 palsy, (0) 

Note. CI = confidence interval; Q, denotes the heterogeneity status between all subcategories of a particular variable under testing; ITS = intelligent 


tutoring system. 


* This subcategory included one study in which the ITS intervention lasted more than one school year. 


pan 56 sap) <0. 


The adjusted effectiveness of ITS associated with each of these 
three sample size categories varied significantly under a fixed- 
effect model, Q,(2) = 12.49, p = .002, but not under a random- 
effects model, Q,(2) = 2.77, p = .251. Further analyses showed 
that the effects did not differ significantly between studies with 
sample sizes of less than 200 and those with sample sizes over 200 
under a fixed-effect model, Q,(1) = 2.03, p = .154, nor did it 
under a random-effects model, Q,(1) = 1.96, p = .162. 

Research design. For unadjusted overall effect sizes, 15 were 
from quasi-experimental studies and 11 were from true experi- 
ments. The average of the effect sizes from the quasi-experiments, 
g = .09, 95% CI [.05, .14], was larger than that of from true 


experiments, g = —.01, 95% CI [—.07, .05], under a fixed-effect 
model, Q,(1) = 7.97, p = .005. Only the average effect for studies 
using quasi-experimental designs was significantly different from 
0. The difference was not significant under a random-effects 
model, Q,(1) = 0.37, p = .544. 

The average of adjusted overall effect sizes from quasi- 
experiments and true experiments did not differ under either a 
fixed-effect model, Q,(1) = 0.92, p = .338, or under a random- 
effects model, Q,(1) = 1.09, p = .296. None of the average effects 
were significantly different from 0. 

Year of data collection. For unadjusted overall effect sizes, 
the effects varied depending on the year in which the data were 
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collected under a fixed-effect model, Q,(2) = 10.51, p = .005, but 
not under a random-effects model, Q,(2) = 3.01, p = .222. Only 
the average effect for studies conducted before 2003 (showing a 
positive ITS effect) appeared significantly different from 0 under 
a fixed-effect model. For adjusted overall effect sizes, the effects 
did not differ significantly in terms of data collection time under 
either a fixed-effect model, Q,(2) = 2.31, p = .315, or a random- 
effects model, Q,(2) = 0.715, p = .699. 

Counterbalanced testing. For unadjusted overall effect sizes, 
the impact of ITS on students’ mathematical learning appeared to 
be lower in studies with counterbalanced testing, g = —.05, 95% 
CI [—.12, .02], than in studies without counterbalanced testing, 
g = .09, 95% CI [.05, .13], under a fixed-effect model, Q,(1) = 
10.43, p = .001. The average effect size from counterbalanced 
studies did not differ from 0, whereas the average effect from 
studies not using counterbalancing did. The difference was not 
significant under a random-effects model, Q,(1) = 1.09, p = .297. 

For adjusted overall effect sizes, the impact of ITS appeared to 
be significantly larger in studies with counterbalanced testing, g = 
.08, 95% CI [.01, .14], than that of studies without counterbalanced 
testing, g = —.08, 95% CI [—.16, —.004] under a fixed-effect 
model, Q,(1) = 8.89, p = .003. The difference was not statistically 
significant under a random-effects model, Q,(1) = 0.44, p = .507. 

Report type. We grouped the reports into two categories: 
peer-reviewed journal reports and nonjournal reports. Nonjournal 
reports include government reports, conference papers, private 
reports, master’s theses, and doctoral dissertations. For unadjusted 
overall effect sizes, the average effect size in peer-reviewed jour- 
nals, g = .28, 95% CI [.18, .37], was higher than that of nonjournal 
reports, g = .02, 95% CI [—.02, .05], under a fixed-effect model, 
Q,,(1) = 24.45, p = .000. The average effect of ITS was positive 
in studies in peer-reviewed journals. Under a random-effects 
model, the average effect size in peer-reviewed journals, g = .30, 
95% CI [.17, .43], was also statistically significantly higher than 
that of nonjournal reports, g = —.01, 95% CI [—.15, .13], Q,C1) = 
10.03, p = .002; again, the average effect of ITS was positive in 
studies in peer-reviewed journals. For adjusted overall effect sizes, 
the average effect size in peer-reviewed journals was not different 
from that of nonjournal reports under a fixed-effect model, 
QO,(1) = 0.69, p = .407, nor was the case under a random-effect 
model, Q,(1) = 1.98, p = .160. 


Testing for Moderators on All Unadjusted Effect Sizes 


The data set of all unadjusted effect sizes consisted of all of the 
44 unadjusted effect sizes, not averaged within independent sam- 
ples. This data set allowed us to study two moderators: the mea- 
surement timing and the types of outcomes. The analysis results 
are also included in Table 2. 

Measurement timing. Within a single independent sample, 
we averaged the effects sizes that related to the same measurement 
timing. This reduced the 44 unadjusted effect sizes to 27 unad- 
justed effect sizes that were either associated with outcomes mea- 
sured at the end of the school year or measured sooner than the end 
of the school year. The effectiveness of ITS when measured at the 
end of school year, g = .01, 95% CI [—.03, .04], was lower than 
that when measured sooner than that, g = .25, 95% CI [.14, .35], 
under a fixed-effect model, Q,,(1) = 17.84, p = .000, but not under 
a random-effects model, Q,(1) = 3.02, p = .082. 


Outcome type. As above, we averaged the effect sizes corre- 
sponding to the same outcome type within each independent 
sample. This resulted in 36 effect sizes in our analyses. Five 
different types of outcomes appeared in the studies. They are (a) 
course grades, (b) course passing rates, (c) scores from +tests 
developed by teachers or researchers to specifically measure stu- 
dents’ learning on the knowledge content that was covered by 
interventions, (d) scores from modified standardized tests that 
were either substrands of standardized tests or tests made up of 
some of the released standardized test questions, and (e) scores 
from standardized tests. Our preliminary analyses showed that 
there was no statistically significant difference between the aver- 
age effect size associated with course grades or course passing 
rates and that of specifically designed tests, nor was a statistically 
significant difference between the average effect size associated 
with modified standardized tests and that of standardized tests. 
Thus, we grouped the first three outcome types into course-related 
measures and the last two outcome types into measures from 
standardized tests. Results show that under a fixed-effect model, 
the average effect size for course-related measures was g = 0.19, 
95% CI [.11, .27], and g = 0.02, 95% CI [—.02, .06] for measures 
from standardized tests, Q,(1) = 13.19, p = .000. The difference 
was not significant under a random-effects model, Q,(1) = 2.09, 
p = .148. Course-related measures showed a larger and positive 
ITS effect and were significantly different from 0 under both 
models. 


Effectiveness of ITS in Comparison to Other 
Treatment Conditions 


In comparison to homework. Mendicino et al. (2009) com- 
pared 28 fifth-grade students learning math in two different home- 
work conditions over a period of | week. In one condition, stu- 
dents completed paper-and-pencil homework. In another 
condition, students completed Web-based homework using the 
ASSISTment system. ASSISTment is a Web-based homework 
system that facilitates students’ learning by providing scaffolds 
and hints. A number sense problem set and a mixed-problem test 
were used to measure students’ learning after each homework 
condition. To reduce the possibility that other factors might impact 
learning, Mendicino et al. implemented counterbalanced proce- 
dures so that all students participated in both paper-and-pencil and 
Web-based conditions. They were tested both before and after the 
intervention. Mendicino et al. reported an adjusted effect size of 
0.61 favoring ITS and concluded that students learned significantly 
more with the help of the Web-based system than by working on 
paper-and-pencil homework. 

Radwan (1997) compared the math performance of 52 fourth 
graders. Half were tutored using the Intelligent Tutoring System 
Model and the other half received no tutoring but worked on 
completing homework, both during the fifth period of school days. 
The experiment lasted for a total of 15 hr 40 min every day for 4 
weeks. Students’ learning was measured through a pretest and 
posttest of the Computerized Achievement Tests. Radwan’s ¢ tests 
on the test scores concluded that the experimental group performed 
significantly better than the control group did. On the basis of the 
overall score on the Computerized Achievement Tests, we found 
that this study yielded an unadjusted g = .40, SE = .28, and an 
adjusted g = .60, SE = .28. 


982 STEENBERGEN-HU AND COOPER 


In comparison to human tutoring. Beal et al. (2010) studied 
the effectiveness of AnimalWatch, an intelligent tutoring system 
designed to help students learn basic computation and fraction 
skills to enhance problem-solving performance. The participants 
were sixth graders enrolled in a summer academic skills class in 
Los Angeles, California. Once per week for 4 weeks, 13 sixth 
graders spent | hr with math tutors and then 45 min with Animal- 
Watch, and 12 sixth graders learned math with their tutors (each 
tutor helped four to six students) using small group activities 
including blackboard lessons and worksheet practice. The mean 
proportion of correct scores was used to measure students’ perfor- 
mance. Beal et al. concluded that students who spent half of their 
time using ITS and half of time with human tutors improved as 
much as did those who spent the entire time learning with a human 
tutor. We found that this study yielded an unadjusted g = .20, 
SE = .39. 


Discussion 


Summary of the Evidence 


Findings of this meta-analysis suggest that, overall, ITS had no 
negative and perhaps a very small positive effect on K-12 stu- 
dents’ mathematical learning relative to regular classroom instruc- 
tion. When the effectiveness was measured by posttest outcomes 
and without taking into account the potential influence of other 
factors, the average unadjusted effect size was .05 under a fixed- 
effect model and .09 under a random-effects model favoring ITS 
over regular classroom instruction. After controlling for the 
influence of other variables (e.g., pretest scores), the average 
adjusted effect size was .01 under both a fixed-effect model and 
random-effects model also favoring the ITS condition. However, 
the average relative effectiveness of ITS did not appear to be 
significantly different from 0 except when effect sizes were unad- 
justed and a fixed-effect analysis model was used. Also, whether 
controlling for other factors or not, there was a high degree of 
heterogeneity among the effect sizes. 

Very few studies compared ITS with homework or human 
tutoring. The few existing studies showed that when compared 


Table 4 


with homework or human tutoring, the relative effectiveness of 
ITS appeared to be small to modest, with effect sizes ranging from 
.20 to .60. 

Testing for moderators yielded some informative findings. Ta- 
ble 4 presents a summary of the findings from moderator analyses 
using two different estimates of effect (i.e., unadjusted and ad- 
justed effect sizes) and two analysis models (i.e., fixed-effect and 
random-effects models). Two findings were relatively robust. 
First, the effects appeared to be greater when the ITS intervention 
lasted for less than a school year than when it lasted for one school 
year or longer. This effect appeared regardless of whether the 
moderator analyses were conducted on unadjusted or adjusted 
effect sizes with a fixed-effect or random-effects model. Second, 
the effects of ITS appeared to be greater when the study samples 
were general students than when the samples were low achievers. 
And under a fixed-effect model, this difference was statistically 
significant regardless of whether the analyses were conducted on 
unadjusted or adjusted effect sizes. 

Also, there was some evidence for the following three findings 
related to the methodology of the study: (a) The effectiveness of 
ITS appeared to be largest when the learning outcomes were 
measured before the end of the school year, (b) the effects of ITS 
appeared to be greater when measured by course-related outcomes 
than when measured by standardized tests, and (c) the average 
effect size of studies with smaller sample sizes appeared to be 
bigger than that of larger sample sizes. In general, these results are 
consistent with the findings related to methodological characteris- 
tics of primary studies in numerous meta-analyses. 


Overall Effectiveness of ITS 


The conclusion that ITS had no negative and perhaps a very 
small positive effect on K-12 students’ mathematical learning 
relative to regular classroom instruction is largely congruent with 
the WWC’s conclusions regarding the effects of math educational 
software programs (WWC, 2004, 2010a, 2010b). Specifically, the 
WWC (2010a) concluded that Carnegie Learning Curricula and 
Cognitive Tutor software had no discernible effects on mathemat- 
ics achievement for high school students but Cognitive Tutor® 


Findings From Testing for Moderators Across Two Types of Effect Sizes and Two Analysis Models 








Unadjusted Adjusted 
Variable ITS favored for Fixed Random Fixed Random 

Subject Basic math Yes Yes Yes: Yes 
ITS duration Less than one school year NWeSsas Yes+ Yes+ MESAF 
Sample achievement level General students WeSaP Yes Vesa Nes 
Schooling level Elementary school ES Yes Yes Yes 
Sample size Sample size less than 200 esti Yes Yes Yes 
Research design Quasi-experiments WeESai ies No Yes 
Year of data collection Before 2006 Yes No Wes No 
Counterbalanced testing No ESE Yes No+ No 
Report type Peer-reviewed journal Yess ieSaE No Yes 
Measurement timing Sooner than end of school year Wess Mes 

Outcome type: Course-related outcome measures Yess Yes 


a 
Note. Yes denotes that the subcategory, for example, basic math, appears to be favored over the other subcategory or subcategories of that variable (i.e., 
subject). A + denotes that the effects of the intelligent tutoring system (ITS) on the favored feature (i.e., subcategory), for example, basic math, was 
statistically significantly greater than those on the other feature (i.e., subcategory), such as algebra. 


INTELLIGENT TUTORING SYSTEMS META-ANALYSIS 983 


Algebra I had potentially positive effects on ninth graders’ math 
achievement. The WWC (2004) found that students who used 
Cognitive Tutor earned significantly higher scores on the Educa- 
tional Testing Service Algebra I test and on their end-of-semester 
grades than their counterparts who were taught with traditional 
instruction. Furthermore, the WWC (2010b) concluded that 
PLATO Achieve Now had no discernible effects for six graders’ 
math achievement but the WWC considered the extent of evidence 
to be small. 

It is relevant to mention that the WWC conclusions were based 
on a very limited number of studies that met their evidence 
standards or met their standards with reservation. The present 
meta-analysis included all seven reports that had been identified as 
meeting the WWC’s evidence standards or meeting their standards 
with reservation. This meta-analysis covered studies of many other 
ITS programs in addition to the two reviewed by the WWC. Thus, 
despite the differences in review scopes and methodology, the 
finding that ITS appeared to have no negative and perhaps a small 
positive effect on students’ mathematical achievement is largely 
consistent with the conclusion from the WWC reviews. 

Comparing the present meta-analysis with a recent meta-review 
by VanLehn (2011) illuminates an interplay of many issues per- 
taining to the effectiveness of ITS. VanLehn (2011) reviewed 
randomized experiments that compared the effectiveness of human 
tutoring, computer tutoring, and no tutoring. He found that the 
effect size of ITS was 0.76, which was nearly as effective as 
human tutoring, d = 0.79. It appears that VanLehn (2011) found 
a larger effect than what the present meta-analysis revealed. How- 
ever, we found that these two systematic reviews are different in at 
least three ways. First, the two reviews differ in subject domains 
and grade levels of students. The VanLehn (2011) review included 
studies of science, technology, engineering, and mathematics, with 
no restriction of grade levels. As a result, it included a large 
portion of studies on the use of ITS in college students’ learning. 
The present meta-analysis focuses on the effectiveness of ITS on 
K-12 students’ mathematical learning. Second, the two reviews 
had different methodological standards and applied different study 
inclusion criteria. VanLehn (2011) covered experiments that ma- 
nipulated ITS interventions while controlling for other influences 
and excluded studies in which the experimental and comparison 
groups received different learning content. For example, it ex- 
cluded all studies of Carnegie Learning’s Cognitive Tutor because 
students in the experimental groups used a different textbook and 
classroom activities than did those in the comparison groups. In 
contrast, in the present meta-analysis, we placed no such restric- 
tions. In fact, our meta-analysis includes studies that compared two 
ecologically valid conditions in which ITS may or may not be the 
only difference between the conditions. As a result, 20 out of the 
31 independent studies included in this meta-analysis are studies of 
Cognitive Tutor. Last, VanLehn (2011) selected the outcome with 
the largest effect size in each primary study. The present meta- 
analysis extracted effect sizes for all the outcomes possible in each 
study and averaged them. Taken together, the differences men- 
tioned above may help explain the seemly discrepant findings from 
these two reviews. An overarching message from this is that when 
addressing the effectiveness of ITS, as is the case with many other 
educational interventions, one ought to ask a few questions: for 
whom, compared with what, and in what circumstances? 


We compared the findings of the current meta-analysis with 
those of four recent reviews that focused on the effectiveness of 
computer technology or educational software on Pre-K to 12th 
graders’ mathematical achievement. Methodologically, these re- 
views are also largely comparable with the current meta-analysis. 
In general, compared with the findings of some similar meta- 
analyses or systematic reviews of the effectiveness of educational 
technology, the effects of ITS appear to be relatively small. 

Kulik (2003) reviewed 36 controlled evaluation studies to ex- 
amine the effects of using instructional technology on mathematics 
and science learning in elementary and secondary schools. He 
found that the median effect of integrated learning systems was to 
increase mathematics test scores by 0.38 standard deviations, or 
from the 50th to the 65th percentiles. He also found that the 
median effect of computer tutorials was to raise student achieve- 
ment scores by 0.59 standard deviations, or from the 50th to the 
72th percentiles. 

Murphy et al. (2002) reviewed 13 studies of the efficacy of 
discrete educational software on Pre-K to 12th grade students’ 
math achievement. They found that the overall weighted mean 
effect size for discrete educational software applications in math 
instruction was 0.45, and the median effect size was 0.27. On the 
basis of the distribution of confidence intervals, they concluded 
that d = 0.30 or greater appeared to be a reasonable estimate for 
the effectiveness of discrete educational software on mathematics 
achievement. 

Slavin and Lake (2008) reviewed 38 studies to investigate the 
effects of CAI on elementary mathematics achievement. They 
found that the median effect size was 0.19. In their review of 
middle and high school math programs, Slavin, Lake, and Groff 
(2009) found that the weighted mean effect size was 0.10 for the 
effectiveness of CAI. 

We should be quick to point out that conclusions based on the 
comparisons of the findings from different reviews ought to be 
tentative because there were variations among the reviews regard- 
ing the types of educational technology. As the use of educational 
technology became such a common practice in teaching and learn- 
ing, it is increasingly difficult to picture a matrix of existing and 
ever-changing educational technology. As described in the intro- 
duction and Method section, we defined ITS as self-paced, 
learner-led, highly adaptive, and interactive learning environments 
operated through computers. In the studies included in this meta- 
analysis, ITS delivered learning content to students, tracked and 
adapted to students’ learning paces, assessed learning progress, 
and gave students feedback. We believe these features distinguish 
ITS from other educational technologies in previous reviews. 

One possible explanation for the small effects revealed in this 
meta-analysis is related to the degree of technology implementa- 
tion and the purposes of technology use in classrooms. Evidence 
suggests that computer technology appears to have stronger effects 
when being used as supplemental tools than when used as the only 
or main instructions. For example, Schmid et al. (2009) found that 
in terms of degree of technology use, low (g = 0.33) and medium 
use of technology (g = 0.29) produced significantly higher effects 
than did high use (g = 0.14). They also found that in terms of type 
of technology use, when used as cognitive support (e.g., simula- 
tions), educational technology produced better results (g = 0.40) 
than when it was used as a presentational tool (g = 0.10) or for 
multiple uses (g = 0.29). Also, Tamim, Bernard, Borokhovski, 
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Abrami, and Schmid (2011) found that computer technology pro- 
duced a slightly but significantly higher average effect size when 
used as supporting instruction than when it was used for direct 
instruction. Taken together, these findings imply that computer 
technology’s major strengths may lie in supporting teaching and 
learning rather than substituting or replacing main instructional 
approaches or acting as a stand-alone tool for delivering content. 
However, further research is needed to reach a firm conclusion. 
Schmid et al. (2009) argued that future research ought to move 
away from the “yes-or-no” question and move to other issues, such 
as how much technology is desirable for improving student learn- 
ing and how to best use technology to promote educational out- 
comes. 

In addition, much research has supported the view that educa- 
tional technology can improve student motivation and therefore 
positively influence student academic performance (Beeland, 
2002; Roblyer & Doering, 2010; U.S. Department of Education, 
1995). We speculate that as educational technology has become 
such a common part of learning environments in today’s educa- 
tional settings, student motivation and novelty effects related to the 
access of educational technology might have decreased. As a 
result, the relative effectiveness of educational technology may be 
declining. 

Last, findings of this meta-analysis need to be interpreted with 
caution. As we described earlier, the results were largely based on 
the 31 independent studies that compared learning outcomes of 
instructions with an ITS component with those without one. This 
broad comparison covered four types of categorized situations that 
were different from one another to varying degrees. For studies in 
which ITS were the only difference between the treatment and 
comparison conditions, it is reasonable to conclude that the de- 
tected effectiveness difference can be attributed to ITS. However, 
when ITS were not the only difference between the conditions, 
differences in outcome measures cannot be attributed solely to 
ITS. For example, for studies of Cognitive Tutors, the treatment 
and comparison conditions could differ not only in the use of 
Cognitive Tutors but also in teachers or school environments. In 
such cases, it cannot be ruled out that the effectiveness of ITS is 
masked by the relative ineffectiveness of the other intervention 
components, such as teachers or school environments. It also could 
be the case that the effectiveness of the other intervention compo- 
nents is masked by the relative ineffectiveness of ITS. Taken 
together, this meta-analysis provides information regarding 
whether and how students’ learning outcomes might differ depend- 
ing on the involvement or absence of ITS from the instructions. 
However, one ought to be aware that the effectiveness differences 
may or may not be attributed solely to ITS. 


Findings From Testing for Moderators 


As summarized earlier, two robust findings stand out from this 
meta-analysis. The first finding was that the effects of ITS ap- 
peared to be greater when the intervention lasted for less than a 
school year than when it lasted for one school year or longer. We 
offer three possible explanations for this finding. First, it might be 
that the novelty factor wears off and students’ motivation declines. 
This explanation would suggest that, just as is the case for many 
other interventions, too much of a good thing is not a good thing. 
Again, this brings us back to the important issue regarding how 


much technology is desirable and how to best use technology to 
improving student learning. Second, when ITS were in regular or 
long-term use in schools, researchers usually had no or very little 
involvement in the actual use of ITS during the study. In other 
words, the degree of implementation might have impacted the 
effectiveness of ITS. 

Third, some differences in the durations of interventions might 
be responsible for the differential effectiveness of ITS. Specifi- 
cally, we found that a number of major characteristics of the 
studies in which ITS lasted for one school year or longer might 
account for the small effect sizes yielded in the studies. These 
studies, such as the studies of Cognitive Tutor (e.g., Campuzano et 
al., 2009; Dynarski et al., 2007), were more often based on big 
national samples; used more rigorous study methods, such as 
random assignment; and used more distal outcome measures, such 
as standardized achievement tests. In contrast, studies in which 
ITS lasted for short time or one semester generally produced 
bigger effect sizes for a number of reasons. For example, they 
studied relatively less known ITS, they were more often based on 
small sample sizes, they used less rigorous study methods, and 
they often used specifically designed or nonstandardized outcome 
measures. Many previous meta-analyses have concluded that the 
study differences mentioned above have an impact on the magni- 
tude of effect sizes. Moderator analyses of this meta-analysis 
confirm this conclusion. We need to use caution in applying this 
finding to practices before further research is conducted and a 
firmer conclusion is reached. 

The second finding was that ITS helped general students learn 
mathematical subjects more than it helped low achievers. One 
possible explanation is that ITS may function best when students 
have sufficient prior knowledge, self-regulation skills, learning 
motivation, and familiarity with computers. It is possible that 
general students have more of the characteristics needed to navi- 
gate ITS than low achievers do. Therefore, they benefited more 
from using ITS. For low achievers, classroom teachers, rather than 
ITS, might be better leaders, motivators, and regulators to help 
them learn. Research has found that there are differences in the 
ways that high achievers and low achievers used ITS and other 
computer-based instruction tools (Hativa, 1988; Hativa & Shorer, 
1989; Wertheimer, 1990). For example, Hativa (1988) found that 
low achievers, more than high achievers, were prone to make 
software- and hardware-related errors when working with a CAI 
system. He further concluded that it was possible that high achiev- 
ers were much more able than low achievers to adjust to a CAI 
learning environment so that they were able to benefit more from 
it. 

This finding draws new attention to the debate regarding 
whether the use of computer technology actually widens the 
achievement gap between high achievers and low achievers, stu- 
dents of high and low learning aptitudes, students with advantaged 
and disadvantaged backgrounds, or White and minority students. 
The results from some longitudinal studies of CAI have provided 
support for the notion that computerized learning contributes to the 
increasing achievement gaps between students with different so- 
cioeconomic statuses, achievement levels, and aptitudes (Hativa, 
1994; Hativa & Becker, 1994; Hativa & Shorer, 1989). Ceci and 
Papiero (2005) noted that nontargeted technology intervention that 
is used differently by advantaged and disadvantaged groups of 
students leads to achievement gap widening. This meta-analysis 
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adds further support to the above conclusions with the evidence 
that ITS might have contributed to the achievement gap between 
higher and lower achieving students. It is worth noting that only 
three studies provided results for low achievers. 

This issue merits considerable attention. As mentioned earlier, 
the motivation of ITS development is to help students achieve 
learning gains as they do with the help of expert human tutors. 
There has been the expectation that ITS, as a form of advanced 
learning technology, ought to be able to provide optimal conditions 
needed to teach all children, given their interactivity, adaptability, 
and ability to provide immediate feedback and reinforcement. 
Developers of ITS may also want to consider way to adapt ITS for 
students with a variety of aptitudes and design culturally relevant 
technology learning environments. Further research with more 
nuanced approaches for ITS evaluation is needed to provide more 
information for this issue. 


Conclusion 


This meta-analysis synthesized studies of the relative effective- 
ness of ITS compared with regular class instruction on K-12 
students’ mathematical learning. Findings suggest that overall, ITS 
appeared to have no negative and perhaps a small positive impact 
on K-12 students’ mathematical learning. The main contributions 
of this meta-analysis lie on three fronts. First, it provided further 
evidence for the conclusions that educational technology might be 
best used to support teaching and learning. Second, this meta- 
analysis revealed that ITS appeared to have a greater positive 
impact on general students than on low achievers. This finding will 
likely draw considerable attention in policy debates on the issue of 
whether computerized learning might contribute to the achieve- 
ment gap between students with different achievement levels or 
prior backgrounds. Meanwhile, this finding implies that ITS re- 
search might be helpful in gaining a better understanding of how 
better learners learn through ITS, especially in terms of cognitive 
and metacognitive factors. Third, findings of this meta-analysis 
confirm several conclusions from many previous meta-analyses 
concerning the association between methodological features (e.g., 
sample size, research design, and outcome measure) of primary 
research and the effectiveness of the intervention studied. On the 
basis of the findings of this meta-analysis and similar reviews of 
educational technology, it seems best to think of ITS as one option 
in the array of education resources that educators and students can 
use to support teaching and learning. For students who are moti- 
vated and can self-regulate learning, ITS might be effective sup- 
plements to regular class instruction. However, ITS may not be 
efficient tools to boost low achievers’ or at-risk students’ achieve- 
ment. 
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Using Student Interactions to Foster Rule-Diagram Mapping During 


Problem Solving in an Intelligent Tutoring System 


Kirsten R. Butcher Vincent Aleven 


University of Utah 


In many domains, problem solving involves the application of general domain principles to specific problem 
representations. In 3 classroom studies with an intelligent tutoring system, we examined the impact of 
(learner-generated) interactions and (tutor-provided) visual cues designed to facilitate rule-diagram mapping 
(where students connect domain knowledge to problem diagrams), with the goal of promoting students’ 
understanding of domain principles. Understanding was not supported when students failed to form a visual 
representation of rule—diagram mappings (Experiment 1); student interactions with diagrams promoted 
understanding of domain principles, but providing visual representations of rule—diagram mappings negated 
the benefits of interaction (Experiment 2). However, scaffolding student generation of rule—diagram map- 
pings via diagram highlighting supported better understanding of domain rules that manifested at delayed 
testing, even when students already interacted with problem diagrams (Experiment 3). This work extends the 
literature on learning technologies, generative processing, and desirable difficulties by demonstrating the 
potential of visually based interaction techniques implemented during problem solving to have long-term 
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impact on the type of knowledge that students develop during intelligent tutoring. 


Keywords: problem solving, intelligent tutoring, diagrams, visual interaction 


In domains such as geometry, chemistry, and physics, problem 
solving typically requires learners to move fluidly between 
problem-specific representations and domain-general principles 
(e.g., geometry theorems or physics principles) that govern 
problem-solving strategies and solutions. Early research on exper- 
tise has established that understanding how domain principles 
relate to problem-specific features is a key component of expert 
knowledge (Chi, Feltovich, & Glaser, 1981). However, novice 
students struggle to apply appropriate domain principles across a 
variety of individual problems and often are distracted by super- 
ficial aspects of problem representations (Lovett & Anderson, 
1994; Ross, 1989). Even when provided with worked examples 
that demonstrate a step-by-step model of an expert solution, most 
students spontaneously engage only in superficial processing or 
passive examination of these examples (Atkinson & Renkl, 2007). 
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In geometry and other STEM (science, technology, engineering, 
and mathematics) domains, visual representations are a key aspect 
of specific problem situations. For example, in chemistry, visual 
representations (e.g., ball-and-stick diagrams, Lewis structure di- 
agrams, symbolic structural formulae) are used to depict the po- 
sition of atoms in a molecule as well as the bonds between them. 
In physics, diagrams often are used to depict the physical situation 
represented in a problem as well as the forces operating on the 
problem situation. In geometry, a diagram is used to represent the 
geometric context of a specific problem. Geometry diagrams de- 
pict the geometrical relationships of visual elements (e.g., lines, 
rays) that include the given information (e.g., known angles) 
necessary to solve a problem. In each of these domains, effective 
problem solving requires the learner to connect relevant domain 
principles to key aspects of the visual representation(s) depicting 
the specific, to-be-solved problem. Domain-level principles deter- 
mine what aspects of the visual representation are relevant, and 
govern the problem-solving strategies that can be applied to the 
representation for a correct solution. 

Let us consider an example from geometry. Domain-level prin- 
ciples (more specifically, geometry postulates and theorems) drive 
the calculation of to-be-solved numerical values based upon the 
geometric relationships provided in the problem diagram. For 
example, the linear pair postulate can be applied when two adja- 
cent angles sit on a single line and share a common side. Figure 1 
depicts a situation in which Angle | and Angle 2 are a linear pair. 
In a linear pair, if the measure of Angle | is known, Angle 2 can 
be solved by subtracting the measure of Angle 1 from 180° (since 
a line = 180°). 

In this article, we use the term rule to refer to domain-level 
propositions that govern the selection and application of a correct 
step during problem solving. Thus, we define the process of 
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cally. 


An example of a linear pair of angles depicted diagrammati- 


connecting domain-level rules (in this case, geometry postulates 
and theorems) to specific features of problem diagrams as rule— 
diagram mapping. Rule—diagram mapping represents a specific 
case of connecting domain-level principles to problem features. 
However, it should be noted that use of the term rule in this sense 
is not derived from the formal language of geometry; all geometry 
knowledge takes the form of definitions, postulates/axioms, or 
theorems. Postulates and axioms are statements about basic rela- 
tionships or ideas in geometry that are accepted (or assumed) to be 
true (e.g., “Given any two points, there exists a line between them.”), 
and theorems are statements that have been proven to be true on the 
basis of definitions, postulates/axioms, or previously proven theorems 
(e.g., “Vertical angles are congruent.”). It is not clear the extent to 
which students understand or utilize these formal terms; although 
many instructional materials in geometry explain the terms postulate 
and theorem, some materials use informal language that may blur 
these distinctions. For example, some texts refer to the triangle in- 
equality theorem as a rule (Carnegie Learning, 2007) or as a principle 
(Ryan, 2011). Across other texts, there are inconsistencies about what 
is named as a postulate or theorem. For example, a linear pair of 
angles is supplementary: Some texts call this the linear pair postulate 
(e.g., Carnegie Learning, 2010), and others call it the linear pair 
theorem (e.g., Carter, Cuevas, Day, Malloy, & Cummins, 2012). For 
simplicity’s sake, we refer to all domain-level reasons that students 
can use to justify geometry problem-solving steps as rules. 


Strategies for Connecting Domain Principles and 
Specific Problem Features 


Self-Explanation 


Self-explanation is a robust learning strategy in which learners 
attempt to explain the content of learning materials to themselves, 
focusing on the meaning and importance of the instructional con- 
tent as well as its connections to prior knowledge (e.g., Chi, 
Bassok, Lewis, Reimann, & Glaser, 1989; Chi, de Leeuw, Chiu, & 
LaVancher, 1994; Renkl, Stark, Gruber, & Mandl, 1998). Chi et al. 
(1989) found that only about one quarter of students’ spontane- 
ously produced self-explanations connected solution steps of phys- 
ics problems to domain-level principles. Despite the relative rarity 
of their occurrence, the principle-based self-explanations that were 
generated served to enhance good solvers’ understanding of do- 
main principles. Additional research has further distinguished be- 
tween higher and lower quality self-explanations. Renkl (1997) 


confirmed that—without specific prompting—few students go be- 
yond passive or superficial explanation of examples. However, 
Renkl also noted that successful explainers tended to fall into two 
categories, one of which he termed principle-based explainers. 
These principle-based explainers focused on analyzing the sub- 
goals of the example solution and elaborating on the principles 
related to those subgoals. Thus, these successful explainers con- 
nected domain principles to the specific steps of the example 
solution. 

Although initial work on self-explanation focused on spontane- 
ous processing, further research established that prompting self- 
explanation led to increases in understanding and more accurate 
mental models than in a control condition (Chi et al., 1994). 
Despite its potential support for learning, prompting the generation 
of high-quality self-explanations for large numbers of learners is a 
significant challenge. In computer environments, Hausmann and 
Chi (2002) found that typed self-explanations could be prompted 
as students worked with instructional materials on a computer, but 
these free-form, typed self-explanations were largely paraphrased 
statements that lacked quality. More promising results have been 
obtained with computer-supported interactions that structure the 
content of students’ self-explanations. Conati and VanLehn (2000) 
showed the effectiveness of user-adapted support for self- 
explanation in an intelligent tutoring system (ITS) for physics. 
Using prompts and drop-down menus, their system elicited self- 
explanations that connected problem-solving steps to domain prin- 
ciples and abstract solution plans. Using an ITS for geometry, 
Aleven and Koedinger (2002) showed that prompting students to 
name the high-level geometry principle (such as “corresponding 
angles”) that justified each problem-solving step improved the 
depth of student learning. Notably, when Aleven and Koedinger 
controlled for time on task (Experiment 2), benefits were limited to 
items requiring understanding of geometry rules (1.e., identifying 
unsolvable problems and naming geometry rules) rather than to the 
overall accuracy of numerical solutions (for which shallow 
problem-solving strategies can be more successful). In a system for 
learning probability, Atkinson, Renkl, and Merrill (2003) found 
support for near and far transfer when students were required to 
select (from a multiple-choice set of alternatives) the probability 
principle that justified each solution step (e.g., “multiplication 
principle”) in worked examples. Thus, scaffolding simple (ver- 
bally based) statements of problem-solving principles appears to 
be able to guide students toward deeper understanding of domain 
concepts. 


Focusing Learner Processing on Central Problem 
Aspects 


Naming the problem-solving principle associated with each step 
of a worked example or problem solution may be considered a 
fairly impoverished form of self-explanation compared to self- 
explanations uttered during natural, spontaneously generated 
speech. However, these self-explanation prompts may be effective 
because they focus student activity on key connections between 
the problem at hand and important domain concepts. Schworm and 
Renkl (2007) studied two types of self-explanation prompts in a 
system designed to train argumentation skills. In this research, the 
domain was “argumentation,” and domain-level principles (..e., 
rules of argumentation) can be applied to specific content areas 
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(e.g., political topics). Schworm and Renkl (2007) developed 
self-explanation prompts that targeted either the specific content of 
the topic being argued (e.g., stem-cell research) or a domain-level 
principle of argumentation (e.g., provide an alternative theory). 
Only self-explanations related to the overarching domain princi- 
ples promoted increased learning—self-explanations that focused 
on the specific content of the argument were ineffective. These 
findings are consistent with the general prescription that interac- 
tive features designed to support self-explanation in computer 
environments should target active processing of key conceptual or 
structural aspects of the to-be-learned domain (Atkinson & Renkl, 
2007). This research also highlights the importance of thinking 
beyond specific problem instantiations to make connections to 
broader domain principles. 

In domains where visual and verbal representations are central to 
problem solving, students’ inabilities to understand the ways that 
domain principles relate to specific visual features of problems can 
result in misdirected attention and compromised learning. For exam- 
ple, Kozma (2003) found that chemistry students primarily focused on 
the surface features of representations and failed to connect visual 
features to underlying chemical principles. In contrast, chemistry 
experts utilized multiple representations (including diagrams, tables, 
and symbols) and made explicit connections between structural as- 
pects of the representations and larger conceptual issues. Similarly, 
recent research has found that students perform well on chemistry 
questions that can be answered from visual information alone but 
poorly on questions that require the integration of information across 
representations such as a model and a graph (Stieff, Hegarty, & 
Deslongchamps, 2011). In physics, Wilkin (1997) found that 
problem-solving diagrams can decrease the effectiveness of self- 
explaining because students often use adjacency in diagrams to draw 
erroneous inferences that fail to be revised during learning. In geom- 
etry, novices also tend to focus on the surface-level similarities of 
diagrams. Lovett and Anderson (1994) found that students errone- 
ously attempted to apply the same solution steps to geometry prob- 
lems when diagrams looked similar but drew upon different logical 
structure. Lovett and Anderson concluded that in geometry—and 
other domains where diagrams are central to problem solving—the 
diagram serves as the basis for student recall. Thus, novice learners 
likely need support in understanding how key visual elements are tied 
to larger domain principles. 

Despite students’ apparent need for guidance in connecting visual 
representations to domain principles, self-explanations enacted in 
learning technologies—as discussed above—largely have taken the 
form of verbal statements that offer only weak support for visual- 
verbal connections between individual problems and domain princi- 


ples. Aleven and Koedinger (2002) noted that students who were 
prompted to name the geometry principle related to each problem- 
solving step in an ITS were more likely to perform well on hard-to- 
guess problems, which provided indirect evidence that they had 
developed more integrated visual-verbal knowledge during practice. 
However, these students showed much room for improvement. A key 
question is whether an ITS can improve student learning by helping 
students connect relevant visual and verbal elements. More specifi- 
cally, can student learning in geometry be facilitated by an ITS that 
uses visually based interactive elements to connect (verbally ex- 
pressed) domain principles and (visually represented) diagram fea- 
tures during problem solving? 


Mapping Domain Principles to Problem Diagrams in 
Geometry 


Ideally, self-explanations related to problem-solving diagrams 
should facilitate attention to or use of visual representations in 
ways that mimic expert processes, just as worked examples facil- 
itate learning by prompting students to self-explain expert solution 
steps for a given problem (Atkinson, Derry, Renkl, & Wortham, 
2000). Thus, it is important to consider how experts use geometry 
diagrams during problem solving. Koedinger and Anderson (1990) 
conducted research with experts solving geometry problems and, 
based upon findings from verbal data, developed a model of expert 
problem solving in geometry (the diagram configuration, or DC, 
model). Experts processed diagrams by identifying key configu- 
rations that were used to retrieve corresponding schematic know!]- 
edge. In the DC model, this was instantiated by parsing diagrams 
to identify key configurations (e.g., two parallel lines intersected 
by a transversal); the model used these configurations as well as 
given information about the problem to retrieve relevant schemas. 
Koedinger and Anderson’s analysis of the DC model showed that 
it modeled expert processes well. 

Getting students to recognize key configurations is particularly 
problematic in geometry, where problem-specific diagrams may 
vary widely in appearance even when the same key configurations 
are present. Moreover, superficial similarities in problem diagrams 
may mislead students into retrieving and applying similar geom- 
etry rules even when the problems contain a different underlying 
structure (Lovett & Anderson, 1994). Consider the examples in 
Figure 2. When two parallel lines are cut by a transversal, angles 
on the same side of the transversal and in the same relative position 
to the parallel lines are corresponding angles with congruent 
measures. Figure 2a shows two diagrams where the marked angles 
are corresponding. Figure 2b shows diagrams that are similar in 
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Figure 2. 





Diagrams with similar appearances: Angles | and 2 are corresponding angles in 2a, but not in 2b. 
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appearance, but the marked angles are not corresponding angles. If 
students understand the connection between the rule and the dia- 
gram in a vague or shallow way (i.e., in a square, angles that look 
similar are corresponding), they may incorrectly use correspond- 
ing angles as the (incorrect) justification to (correctly) answer that 
the angles in 2b are equal. Thus, they may get the numerical 
solution correct without understanding the underlying geometry 
rationale for their answers. 

Developing connections between geometry rules and diagram 
configurations during problem solving is complicated by the fact 
that many geometry rules result in students using the same equa- 
tion to generate a numerical solution, but for different logical 
reasons. For example, consider a problem where Angle X is given 
and the student’s goal is to solve for Angle Y. If the angles are base 
angles of an isosceles triangle, Angle X = Angle Y. However, the 
same equation also will produce a correct answer if the angles are 
corresponding angles, alternate interior angles, or a pair of bisected 
angles. Thus, even in an ITS, students often develop shallow 
problem-solving strategies that result, behaviorally, in the obser- 
vation that students are better at finding numerical answers to 
problems than explaining the reasons driving these answers 
(Aleven, Koedinger, Sinclair, & Snyder, 1998). The challenge is 
for students to move beyond these shallow (but often successful) 
strategies to develop an understanding of when and how to apply 
specific geometry rules to diagrams across a variety of problems. 
The purpose of this work is to explore interactive elements that 
support development of this rule—diagram mapping. 


Facilitating Coordination of Visual-Verbal Information 


A variety of methods have been used to try to facilitate connec- 
tions between visual representations during problem solving and 
fundamental domain concepts or principles. Many of these meth- 
ods can be grouped into two major categories: materials that 
physically coordinate visual and verbal information and materials 
that cue the learner to make connections between visual and verbal 
information sources. 

Physical coordination. There is a great deal of research that 
has demonstrated that the ways in which visual and verbal infor- 
mation is combined have impacts on learning (e.g., Bodemer, 
Ploetzner, Feuerlein, & Spada, 2004; Glenberg & McDaniel, 1992; 
Mayer & Anderson, 1992; Moreno & Mayer, 1999; Tabbers, 
Martens, & van Merriénboer, 2004). Results have demonstrated 
that spatial contiguity (Mayer, 2001) between visual and verbal 
information reduces cognitive load by removing the effort associ- 
ated with spatially mapping between information sources. How- 
ever, it is possible to develop situations where requiring students to 
map between visual and verbal information sources can improve 
learning. Research has found that requiring students to actively 
integrate split source materials (e.g., by dragging text labels into a 
visual diagram) improves learning more than does providing learn- 
ers with pre-integrated representations, especially when the mate- 
rials are complex (Bodemer, Ploetzner, Bruchmiiller, & Hacker, 
2005). Further, recent research shows that learners who are pro- 
vided with both concrete and abstract diagrams can transfer their 
knowledge better than learners provided with a single representa- 
tion, a finding likely due to the implicit support that multiple 
representations provide for making connections between existing 
knowledge, current learning materials, and larger domain concepts 


(Moreno, Ozogul, & Reisslein, 2011). Thus, instructional materials 
may support learning if interactions are used to facilitate mapping 
that connects relevant aspects of representations to larger domain 
concepts. 

Visual cuing. Visual cues that focus learner attention on rel- 
evant features of representations during learning have been found 
to be quite powerful in supporting learning with multimedia ma- 
terials. Eye-tracking evidence has shown that attending to impor- 
tant problem features can facilitate problem solving (Grant & 
Spivey, 2003), even when learners are not aware of their atten- 
tional focus (Thomas & Lleras, 2007). Multimedia research has 
found learning benefits when presentations direct learners’ atten- 
tion to relevant content by spotlighting visual features as they are 
discussed in an audio presentation (de Koning, Tabbers, Rikers, & 
Paas, 2007). In fact, de Koning and colleagues (de Koning, Tab- 
bers, Rikers, & Paas, 2010) found no differences in learning 
resulting from learner-generated explanations and from provided 
instructional explanations when visual cues guided attentional 
focus during study. According to de Koning et al. (2010), visual 
cues may serve to increase active processing of instructional materials 
across a variety of explanation conditions. Since visual cues provide 
support in attending to central aspects of the learning materials, it is 
sensible to conclude that the combination of visual cues and expla- 
nation is effective because it concentrates processing on the most 
important aspects of the learning materials. However, it is an open 
question as to whether visual cues can, themselves, serve to facilitate 
effective processing and whether or not learner generation of such 
cues can enhance understanding. 


The Current Research 


Informed by the existing research outlined above, we address 
two key questions regarding computer-supported understanding of 
rule—diagram mapping. The first key question is: Can computer- 
supported interactions focused on key features of the visual dia- 
gram improve students’ abilities to apply domain-general geome- 
try “rules” to specific problem diagrams? If, as proposed by Lovett 
and Anderson (1994), diagrams serve as the basis for student 
recall, focusing student processing on key configurations of visual 
diagrams during problem solving may support understanding and 
long-term recall of rule—diagram connections. In this research, we 
explored two forms of visually targeted interactions: (a) self- 
explanations that were targeted to diagram features and (b) on- 
demand help that provided visual cues for rule—diagram mapping. 

The second key question is: Who should generate the rule— 
diagram mappings, the tutor or the student? In an ITS (and in other 
forms of instruction), information that 1s central to the learning 
task can be either provided or withheld by the tutor (Koedinger & 
Aleven, 2007). For example, an ITS can highlight relevant infor- 
mation in a diagram, or it can require the student to highlight key 
visual features. Although overt activities that require the student to 
generate new information or representations should promote learn- 
ing (Chi, 2009), a crucial factor is how successful the student will 
be in generating the targeted information without excessive floun- 
dering (Koedinger & Aleven, 2007). Thus, specific scaffolding 
may be required to structure the generation process and support 
students’ development of meaningful representations. Another key 
consideration is how effective the generative activities will be in 
helping students process the connections between specific problem 
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features and domain-level principles (in this case, between dia- 
gram features and geometry rules). 

We conducted a series of three experiments that explored the 
impact of different forms of rule—diagram mapping (verbal expla- 
nations or visual representations) as well as the instructional 
source of such mappings (student-generated or tutor-provided) on 
students’ problem-solving success and learning of geometry rules. 
The first experiment focused on an explanation-based approach 
that required students to articulate specific problem features in- 
volved in the application of domain principles; students generated 
a written mapping of diagram features to geometry rules. The 
second experiment examined an alternate approach to identifying 
problem features involved in the application of domain principles; 
in this study, on-demand help provided a visual mapping between 
diagram features and geometry rules. The third experiment exam- 
ined two different methods of visual mapping during problem 
solving: student-generated mappings versus tutor-provided map- 


pings. 


Experiment 1 


In this experiment, we examined the effects of a relatively 
simple way of having students self-explain the rule—diagram 
mapping for each step during problem solving. As in Aleven 
and Koedinger (2002), all students identified the geometry rule 
that justified each problem-solving step during intelligent tu- 
toring practice. However, a rule—diagram mapping factor was 
added to this self-explanation activity in which some students 
also went on to identify the specific diagram features that were 
relevant to the application of the named geometry rule. This 
experiment also varied the degree to which students’ attention 
was focused on visual features in the geometry diagram during 
problem solving, by varying whether students interacted with 
problem diagrams or a solutions table during tutoring practice. 


Method 


Participants. Participants were 96 students from six 10th 
grade geometry classes at a vocational school in rural Pennsylva- 
nia. All classes were taught by the same teacher. Within each class, 
students were randomly assigned to the four experimental condi- 
tions. 

Materials. 

Geometry Cognitive Tutor. As a platform for our research, 
we used the Geometry Cognitive Tutor, one of several existing 
Cognitive Tutors (e.g., Aleven & Koedinger, 2002; Anderson, 
Corbett, Koedinger, & Pelletier, 1995). Cognitive Tutors are a 
type of ITS based on the ACT-R theory of cognition and 
learning (Anderson & Lebiére, 1998); several studies have 
found that Cognitive Tutors are very effective in supporting 
student learning (Anderson et al., 1995; Koedinger, Anderson, 
Hadley, & Mark, 1997). The Cognitive Tutor uses algorithms 
and cognitive models to track students’ skill development and 
to select practice problems for students. The Cognitive Tutor 
also uses a number of mechanisms to reduce cognitive load 
demands, including a step-by-step problem-solving sequence 
(where problem-solving subgoals are laid out for students) and 
immediate feedback at every step. None of these successful 
tutor features were manipulated in the current work. 


The Geometry Cognitive Tutor is part of a full-year “hybrid” 
course in geometry that includes a text, ancillary materials, train- 
ing for teachers, and the Cognitive Tutor software (Ritter, Ander- 
son, Koedinger, & Corbett, 2007). Before participating in this 
research, all students had been using the Geometry Cognitive 
Tutor as part of their classroom curriculum for several months and 
were familiar with its basic functions (e.g., how feedback is 
displayed). 

The research design for this experiment was a 2 X 2 factorial 
design that varied the locus of interaction during problem solving 
(an interactive diagram vs. a solutions table) and the self- 
explanation of rule-diagram mappings (no mapping vs. rule— 
diagram mapping). We first describe the two levels of the locus of 
interaction factor, followed by the two levels of the mapping 
factor. 

Table interaction tutor. When the locus of interaction was the 
solutions table, all student interactions took place in a table sepa- 
rate from the geometry diagram (see Figure 3). Students entered 
answers and received tutor feedback in the table. As typical in the 
Geometry Cognitive Tutor, students needed to enter all values and 
rules correctly to complete a problem. Thus, students revised 
incorrect entries until correct. A static diagram (1.e., the diagram 
did not change in any way and students could not interact with it) 
was provided for each problem. 

Diagram interaction tutor. When the locus of interaction was 
the diagram, students interacted directly with the geometry dia- 
gram as they worked in the Cognitive Tutor. The unknown (to- 
be-solved) quantities were represented in the diagram by question 
marks (see Figure 4). When students clicked a question mark, a 
small work area opened (co-located with the diagram) that allowed 
students to enter answers and receive feedback. As in the table 
interaction tutor, students needed to enter all values and rules 
correctly to complete a problem; students revised incorrect entries 
until correct. Correct numerical solutions were integrated directly 
into the diagram (see Figure 4). 

No mapping. In this level of the mapping factor, the second 
experimental factor we varied, students entered only the numerical 
solutions and geometry rules for each problem-solving step (see 
Figure 5). Rules were either manually typed or selected from the 
tutor glossary. 

Rule—diagram mapping. In the rule-diagram mapping con- 
dition, students were required to name the diagram features that 
were necessary to use the geometry rule that they had named for 
the problem-solving step. Necessary diagram features were 
defined as those that were used in the application of this rule. 
Because some postulates and theorems operate on multiple 
diagram features (e.g., the angle addition postulate requires the 
addition of two angles), the tutor scaffolded student answers by 
activating the number of “applied to” fields that corresponded 
to the number of arguments required in the selected rule. For 
example, in Figure 5, the central angle theorem requires only 
one argument (the known central angle or its intercepted arc), 
so only one “applied to” field is activated for completion. In 
order to expedite student identification of relevant diagram 
elements, students clicked on the quantities displayed in the 
diagram or the solutions table as a convenient (“one-click’’) 
shorthand for naming the diagram features corresponding to 
these values (e.g., an angle or an arc). The tutoring software 
displayed the name of the corresponding diagram feature (e.g., 
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Figure 3. An annotated screenshot of the Cognitive Tutor where the table is the locus of interaction. 


Arc EO) in a dedicated field in the open work area (or the 
solutions table) called the “Applied To” field. 

For example, as can be seen in Figure 5, the student has 
named the “Central Angle” theorem as the principle used to find 
the measure of Angle OTE. Since a central angle is equal to the 
measure of its intercepted arc, the student needs to indicate the 
intercepted arc. To do so, the student clicks the solved value of 
85.8 for the arc (either in the diagram or the table, depending 
upon condition), which enters “Arc EO” in the “Applied To” 
field. Students received immediate feedback on named diagram 
elements, as for all other submitted answers in the Cognitive 
Tutor. As with all Cognitive Tutor answers, students were 
required to revise any incorrect answers until all entries were 
completed correctly. 

Assessments: Pretest, immediate posttest, and delayed posttest. 
Student learning was assessed via three assessments: a pretest, 
a posttest, and a delayed posttest. The pretest and immediate 
posttest consisted of the same problems but in different orders 
to minimize superficial recognition; the delayed posttest was 
composed of new problems. The pre- and posttest contained 16 
problems (two problems for each of eight geometry diagrams). 
The delayed posttest contained eight problems (two problems 
for each of four geometry diagrams). Although the posttest and 
delayed posttest targeted geometry rules from the same unit of 


study, the delayed posttest included somewhat less complex 
diagrams with fewer embedded shapes (see Figure 6). All tests 
were administered individually by computer, using the tutoring 
software but without any tutoring (i.e., no feedback or hints); 
answers were recorded in software logs. 

Solvability decisions. For each problem, students first were 
required to make a solvability decision, that is, to determine if a 
learned geometry principle would allow them to solve the problem 
with the available information. This solvability decision required 
students to reason carefully about the diagram features needed to 
apply a geometry rule and to determine if one of these rules was 
relevant to the existing problem. Solvability judgments are chal- 
lenging in that they require students to consider all potentially 
relevant geometry rules and diagram relationships before an- 
swering “no.” Students received | point for each correct solv- 
ability decision (pre/posttest maximum = 16, delayed posttest 
maximum = 8). 

Numerical solutions. These items tested students’ abilities to 
generate the correct numerical solution for to-be-solved angles 
(e.g., Angle ABC = 60°). Students received | point for every 
correctly solved item. (Due to the inclusion of solvability 
decisions, not all items had numerical solutions: pre/posttest 


maximum = 12; delayed posttest maximum = 5). 
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Figure 4. An annotated screenshot of the Cognitive Tutor where the diagram is the locus of interaction. 


Rule application. For each problem, students were asked to 
name the geometry rule that they used to derive their numerical 
solution and to map the features of the geometry diagram to the 
geometry rule that they had named (e.g., corresponding angles: 
Angle ABC = Angle FGD). Students received | point for each 
correctly identified rule and for each correct mapping to dia- 
gram features (pre/posttest maximum = 32; delayed posttest 
maximum = 16). 

Statistical analyses. For each of the three experiments re- 
ported here, a series of three analyses of variance (ANOVAs) 
was conducted to assess the impact of experimental interven- 
tions on different types of student knowledge. A separate anal- 
ysis was conducted for each of the three major types of assess- 
ment items: solvability decisions, numerical solutions, and rule 
application. In Experiment 1, a series of three repeated- 
measures ANOVAs were conducted where the independent 
variables were locus of interaction (table vs. diagram) and 
mapping (no mapping vs. rule—diagram mapping) and the re- 
peated factor was test time (immediate posttest, delayed post- 
test). A Bonferroni correction was used to adjust alpha levels 
for multiple comparisons; analyses of student outcomes used an 
alpha level of p = (.05/3) = .017. 


Because we were most interested in whether students were 
able to apply geometry rules accurately during problem solving, 
we analyzed student outcomes based on the percent correct of 
attempted answers (much like tests of cognitive skills that 
assess performance based upon accuracy of attempted items; 
cf.Hegarty & Waller, 2004). Students can skip problems for 
many reasons—they may have run out of time, missed an item 
as they worked, or determined that they didn’t know the answer. 
By analyzing percent correct of attempted items, we assessed 
how well students were able to apply their geometry knowledge 
when we knew that they attempted to do so. Percent correct of 
attempted responses was calculated by dividing the number of 
correct answers by the total number of attempted answers; 
tables of means and standard deviations also report raw perfor- 
mance rates (percent correct) and attempt rates (percent at- 
tempted) for each condition across the assessment types (see 
Table 1). 

Procedure. During the experiments described here, students 
participated in the study as part of their normal classroom activi- 
ties, using the Geometry Cognitive Tutor for one (75 min) class- 
room block each week. In the first week of the study, students 
spent up to 30 min completing the pretest. Students spent three 


USING INTERACTIONS TO FOSTER RULE-DIAGRAM MAPPING 995 


Oe “Leck Back Progress {il ! 
DIAGRAM io ae 


In circle T shown here, the measure of arc EO is equal to 15.8 degrees 


No Mapping: 


Students name 
only the 
geometry rule 

| used in the 
problem-solving 
step. 





m<OTE= 


Applied to 


PEE lint dese For each problem-solving step, students 
; | must name not only the geometry rule but 


Applied to 


ARTY base eee Rule-Diagram Mapping: , Bas 


Appliedto Appliedto Applied to 


Central Angle 


& 


| Index| Search) Interior Angle 


Birind Pi lo Ibe | Definition: 

- | Central Angle “| |The measure of an interior angle in a circle is equal to half of 
| 
i 


| Chord Product j \the sum ofthe measures of the interior angle’s intercepted arc 
| Circle Area j | and the intercepted arc of the interior angle's vertical angle 

| Circle Fraction i 4 

| Circumference 

| Congruent Chords 
| Congruent Radii 


| Example: 
| Line m and line n intersect at the point E inside the 
i circie M. Angle AEB (angle 1) is an interior angie of 


ses 





ircte M. arc AB is the intercepted arc of the interior 
D is the intercepted arc of the interior 


e AEB 
easure of arc AS + measure of arc CD 
2 


Sige enero sears 


also the diagram features used in the 
application of the rule. 





Figure 5. An annotated screenshot of the Cognitive Tutor’s rule—diagram mapping conditions. 


classroom blocks (one per week in Weeks 2—4 of the study) 
working in the angles units of the Geometry Cognitive Tutor using 
their randomly assigned condition. In Week 5, students took the 
immediate posttest (30 min). One month following the posttest, 
students completed the delayed posttest (20 min). 





Results 


We limited our analyses to the 53 students who completed both 
the posttest and the delayed posttest. Sample size was comparable 
across conditions, as can be seen in Table 1, which shows the 








Figure 6. Example diagrams from a posttest item (6a) and a delayed posttest item (6b). 
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Table | 


Experiment I Means and (Standard Deviations) for Posttest and Delayed Posttest Assessment Items 


Table interaction 


Diagram interaction 














Item type No mapping Mapping No mapping Mapping 
Cognitive Tutor position prior to study 18.2 (3.6) 19.2 (3.5) Next (Ge) 17.8 (4.8) 
Posttest 
Solvability decisions (n = 10) (n = 14) (n = 15) (n = 14) 
% correct Sesto.) 57.1 (19.3) 63.0 (12.6) 61.3 (18.3) 
% attempted 96.3 (11.9) 96.4 (11.7) 98.2 (6.7) 100 (0) 
% correct of attempted 53.9 (16.6) 59.5 (18.5) 64.1 (12.1) 61.3 (18.3) 
Numerical solutions* (n = 9) (n = 12) (n = 15) (n = 14) 
% correct B17 (22,2) 33.3 (34.9) 41.1 (32.6) 38.9 (30.0) 
% attempted 46.3 (17.2) 54.5 (27.7) 68.8 (24.4) 57.9 (26.5) 
% correct of attempted 47.6 (29.8) 38.3 (34.1) 48.0 (30.0) SIEON3a7) 
Rule application (n = 10) (n = 14) (n = 15) (n = 14) 
% correct 19.4 (15.5) 30.4 (22.3) 34.6 (18.6) 831235) 
% attempted 96.3 (11.9) 96.2 (12.5) 98.0 (7.5) 100 (0) 
% correct of attempted 20.2 (15.6) SIROI@IKG) 34.9 (18.1) 32.9 (23.3) 
Delayed posttest 
Solvability decisions (n = 10) (n = 14) (n = 15) (n = 14) 
% correct 51.3 (19.9) 42.0 (19.4) 51.8 (23.4) Ses Is) 
% attempted 95.8 (14.1) 96.1 (21.7) 96.4 (13.4) 100 (0) 
% correct of attempted DSc ven) 48.4 (23.4) 53.6 (22.2) Si-5) (G22) 
Numerical solutions* (n = 9) (n = 12) (n = 15) (n = 14) 
% correct 24.0 (28.0) 25.7 (31.8) 31.4 (28.0) 34.7 (29.7) 
% attempted 47.5 (26.9) 45.5 (30.9) 57.1 (25.8) 58.3 (27.0) 
% correct of attempted 34.7 (36.2) 39.5 (39.4) 38.0 (33.7) 44.4 (36.7) 
Rule application (n = 10) (n = 14) (n = 15) (n = 14) 
% correct 23.8 (19.5) 23,2) (1422) 21.4 (17.3) 28.3 (25.8) 
% attempted 95.0 (15.8) 92.0 (21.7) 96.4 (13.4) 100 (0) 
% correct of attempted 24.2 (19.0) 25m (S:4) 21.0 (16.7) 28.2 (25.8) 





“ One student in the table interaction/no mapping condition and two students in the table interaction/mapping condition attempted no numerical solutions 


and were dropped from analyses using percent correct of attempted. 


means and standard deviations for assessment data. Although high, 
this rate of attrition is consistent with other studies conducted at 
the school that have experienced over 60% attrition (Salden, 
Aleven, Schwonke, & Renkl, 2010; Walker, Rummel, & Koed- 
inger, 2009).! 

Due to a server error, pretest responses were not saved for 31 
students and pretest data therefore were not used in analyses. As a 
check of random assignment, we conducted a two-way ANOVA 
where the number of units completed in the Cognitive Tutor prior 
to the start of the current study was the dependent variable and the 
locus of interaction and mapping factors were independent vari- 
ables. Results showed no significant main effects or interactions 
(F's < 1). Since students’ classroom grades were calculated largely 
based upon their progress in the tutor, these results suggest that 
initial classroom performance was equivalent across conditions. 

Learning outcomes. 

Solvability decisions. Results showed no significant main ef- 
fect of test time (F,,, 49) = 3.5, p = .07, np = .07) and no main 
effects of locus of interaction (F(,, 49, = 1.9, p = .18, np = .04) or 
mapping (F < 1). There were no significant two-way interactions 
(Fs < 1). The three-way interaction among test time, locus of 
interaction, and mapping was not significant (F(;. 49) = 1.7, p = 
.20, np = -03). 

Numerical solutions. Results showed no significant main ef- 
fect of test time (Fi, 46) = 1.9, p = .17, 15 = .04) and no main 


effects of locus of interaction or mapping (F's < 1). There were no 
significant two-way interactions (F's < 1). The three-way interac- 
tion among test time, locus of interaction, and mapping also was 
not significant (F < 1). 

Rule application. Results showed no significant main effect of 
test time (Fy, 49) = 2.3, p = .14, 15 = .05) and no significant main 
effects of locus of interaction or mapping (F's < 1). There were no 
significant two-way interactions: test time by locus of interaction 
(Fu, 49) = 1.7, p = .20, np = .03), test time by mapping (F < 1), 
locus of interaction by mapping (F < 1). The three-way interaction 
was not significant (F(;, 49) = 2.0, p = .17, np = .04). 


Discussion 


Overall, the results from Experiment | showed that a simple 
form of self-explanation targeted to rule-diagram mapping—that 


"In this study, the relatively high absentee rate likely is due to at least 
two factors. First, although students completed trade classes and the 
Pennsylvania state mathematics core at the vocational school, all other 
courses were completed at a traditional high school in the student’s home 
school district. Thus, individual school schedules and special activities 
contributed to absences at the vocational school. Second, the timing of the 
study was determined by the course schedule of curriculum topics and 
resulted in the delayed posttest being given the first week following winter 
vacation—a likely contributor to high rates of student absence. 
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is, clicking a numerical quantity to “name” the diagram features to 
which a geometry rule applied—did not support understanding of 
these rules. Further, the locus of interaction did not demonstrate a 
significant impact on student learning. 

Unlike in previous research (Butcher & Aleven, 2007), we did 
not see a significant benefit for diagram interaction when consid- 
ering student outcomes. However, Butcher and Aleven (2007) 
used only a posttest and analyzed diagram mappings separately 
from rule naming. A post hoc analysis of the current data at 
posttest showed results that were weak but generally consistent 
with this prior work. We extracted students’ posttest diagram 
mapping scores from the rule application variable and analyzed 
these data using a two-way ANOVA where locus of interaction 
and mapping were the independent variables. Results showed 
nonsignificant trends for locus of interaction (Fi, 49, = 3.6, p = 
.06, Nh = .07) and for the interaction between locus of interaction 
and mapping (F,,. 49) = 3.4, p = .07, np = .07); there was not a 
significant main effect of mapping (F < 1). Similar to the rule 
application scores in Table | were findings that students who 
interacted with diagrams were best able to name relevant diagram 
features when they did not indicate mappings (diagram interaction 
with no mapping > diagram interaction with mapping), whereas 
the opposite was true for students who interacted with solutions 
tables (table interaction with no mapping < table interaction with 
mapping). Although these post hoc results are not strongly con- 
clusive, it is possible that “stating” the mappings distracted stu- 
dents from their principal learning goal of understanding diagram 
configurations as related to geometry rules, much like requiring 
learners to respond to example gaps in addition to self-explanation 
prompts has been found to be detrimental to learning (Hilbert, 
Renkl, Kessler, & Reiss, 2008). 

Why did this particular implementation of rule—diagram map- 
ping fail to show benefits? One possibility may be that the rule— 
diagram mappings that the tutor elicited were redundant with 
processing that occurred as students determined the numerical 
solution and geometry rule for each problem-solving step. Rule— 
diagram mappings required students to indicate relevant diagram 
features by clicking on solved quantities displayed in the diagram, 
but students already attended to these quantities while generating 
the correct numerical solution. This explanation is supported by 
the fact that during training, rule-diagram mappings were correct 
88% of the time. 

Another possibility is that the mapping implemented in this 
study was too closely tied to specific aspects of individual prob- 
lems (e.g., specific angle names) as opposed to general diagram 
configurations (e.g., two parallel lines intersected by a transversal). 
As noted elsewhere (Atkinson & Renkl, 2007), effective prompts 
must direct students’ attention to domain-level representations 
(Schworm & Renkl, 2007) rather than the specific content of 
individual problems. What would constitute domain-general map- 
pings in geometry? Since geometry postulates and theorems are 
tied to key visual configurations in problem diagrams, visually 
representing these configurations (rather than naming specific di- 
agram features) may be a better method of prompting rule— 
diagram mapping. Thus, we explored an alternative, visually based 
approach to mapping between geometry rules and diagram features 
in Experiment 2. In this study, on-demand help provided a visual 
mapping between rules and diagrams for students. 


Experiment 2 


In this experiment, we examined the impact of rule—diagram 
mapping implemented as visual cues (in the form of diagram 
highlights) within the on-demand help system of the Cognitive 
Tutor. These highlights mapped visual features of the geometry 
diagram to verbal references in the text-based hints. Unlike Ex- 
periment 1, where explanations focused on specific features (e.g., 
Angle ABC) of a problem representation, the visual highlighting in 
Experiment 2 cued the key diagram configurations (e.g., parallel 
lines) relevant to geometry rules across a variety of problem 
representations. 


Method 


Participants. Participants were 109 students from seven 10th 
grade geometry classes at the same vocational school as in Exper- 
iment 1. All classes were taught by the same teacher; both the 
teacher and the students were different from those in Experiment 
1. Students in each class were randomly assigned to one of the four 
experimental conditions described below. 

Materials. 

Geometry Cognitive Tutor. As in Experiment 1, the Geometry 
Cognitive Tutor was used to vary the locus of interaction (diagram 
vs. table). In Experiment 2 we also varied the visual appearance of 
the on-demand hints (highlighted vs. not highlighted). 

Hint format: Highlighted hints versus standard hints. The 
purpose of the highlighted hints was to provide students with a 
visual mapping between diagram features and the geometry rules 
used during problem solving. The highlighted diagram features are 
key to the rule’s applicability conditions. These rule—diagram 
mappings were implemented via step-by-step explanations pro- 
vided in the on-demand hints. Multiple hint levels are available for 
every subgoal in the Cognitive Tutor at the student’s request. The 
highlighted hints provided learners with a color-coded visual map- 
ping between the text referents present in the explanation of the 
geometry rule and the geometry diagram for the current problem; 
the color-coded highlighting was updated as the hint text changed 
when learners continued through the hints (see Figure 7). 

Standard hints provided students with the same explanations 
(i.e., identical text) in the same order as the highlighted condition. 
However, neither the standard hints nor the accompanying prob- 
lem diagrams were highlighted to show connections between ge- 
ometry rules and the specific diagrams. Standard hints appeared as 
plain text. 

Assessments. Assessments were similar to those used in Ex- 
periment | except that a delayed posttest wasn’t possible because 
the circles units targeted by Experiment 2 were positioned at the 
end of the academic school year. There was not enough time left 
in the academic calendar for a delayed posttest to be implemented 
following the posttest. 

Small changes also were made in the assessments in response to 
the teacher’s preference that assessments be given on paper rather 
than on the computer. Whereas the computer interface used in 
Experiment | locked irrelevant answer areas once the student 
made a solvability decision, pilot testing showed that students 
using the paper test typically attempted to complete all blanks. 
Thus, solvability decisions in this experiment were implemented as 
a series of true/false statements (e.g., “You can use the inscribed 
angle rule to find the measure of arc KOP if you know only the 
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Figure 7. Highlighting in two hints for a problem-solving step (find the measure of Arc FG). 


measure of angle KNP.”). Three diagrams were presented with six 
solvability items per diagram; students received | point per correct 
solvability decision (maximum = 18 points). 

Numerical solutions and rule application were tested by 15 geom- 
etry problems that made use of three problem diagrams. As in Ex- 
periment 1, students received | point per correct response (numerical 
solutions maximum = 15; rule application maximum = 30). 

Statistical analyses. Since there was no delayed posttest in 
Experiment 2, learning outcomes were analyzed using a series of 
two-way ANOVAs (one for each assessment category: solvability 
decisions, numerical solutions, and rule application). As in Exper- 
iment 1, a Bonferroni correction was used to correct for multiple 
comparisons. Alpha levels were set at p = (.05/3) = .017. 

Procedure. The procedure was the same as in Experiment | 
except that the study ended with a paper posttest and did not 
include the delayed posttest. 


Results 


Of the 109 students enrolled in the seven geometry classes, 101 
students took the pretest. Of these students, four failed to attempt any 
solvability decisions, 11 failed to attempt any numerical solutions, and 
19 failed to attempt any rule application items. As a check of random 
assignment, three two-way ANOVAs (where locus of interaction and 
hint format were independent variables) were used to examine pretest 
performance on solvability decisions, numerical solutions, and rule 
application. Means and standard deviations for pretest scores are 
shown in Table 2. Results showed no significant main effects of 
experimental conditions on assessment item types: pretest solvability 
decisions (locus of interaction: F,,; 93, = 1.4, p > .24; hint format: 
Fy. 93) = 1.5, p > .25; interaction: F < 1), pretest numerical solutions 
eu of interaction, hint format, and interaction: F's < 1), and pretest 


p = .06, 15 = 


tule application (locus of interaction: F(,;, 7g) = 2.0, p > .16; hint 
format: F(,, 7g) = 2.0, p > .16; interaction: F < 1). Thus, pretest data 
are not considered further. 

Of the 101 students who took the pretest, 77 also took the 
posttest. Of these students, two students failed to attempt any 
numerical solutions (leaving 75 for analysis), and 10 failed to 
attempt any rule application items (leaving 67 for analysis). No 
student failed to attempt solvability decisions, leaving all 77 stu- 
dents for analysis. Attrition was comparable across conditions (see 
Table 2). Table 2 shows the means and standard deviations for 
assessment items by experimental condition. 

Solvability decisions. Results showed no significant main 
effect of locus of interaction (F < 1) or hint format (F(;, 73, = 3.7, 
.05). The interaction between locus of interaction 
and hint format was not significant (F < 1). 

Numerical solutions. Results showed no significant main ef- 
fect of locus of interaction or hint format (F's < 1). The interaction 
between locus of interaction and hint format was not significant 
(Pieris ed spit 2 8ampses 202)! 

Rule application. Although results showed no significant 
main effect of locus of interaction (F,,, 63, = 3.5, p = .07, np = 
.O5) or hint format (F < 1), the interaction between locus of 
interaction and hint format was statistically significant (F.,. 63) = 
6.4, p = .014, me = .09; see Table 2). Follow-up analyses (F 
calculated with the error term from the interaction and applying 
Bonferroni correction) showed that there was a significant differ- 
ence between the locus of interaction conditions when viewing the 
standard hints (Fi, 63, = 9.29, p = .003), but there was not a 
significant difference between locus of interaction conditions 
when viewing the highlighted hints (F < 1), As can be seen in 
Figure 8, students who saw the standard hints were most successful 
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Table 2 
Experiment 2 Means (and Standard Deviations) for Pre- and Posttest Assessment Items 














Table interaction Diagram interaction 
Item type Standard hints Highlighted hints Standard hints Highlighted hints 
Pretest 
Solvability decisions (true/false items) (n = 24) (n = 25) (n = 24) (n = 24) 
% correct 56.0 (57.0) 43.3 (17.3) 46.5 (17.9) 46.8 (13.2) 
% attempted 80.8 (28.8) 79.8 (25.4) 84.5 (25.6) 86.3 (20.4) 
% correct of attempted 68.1 (53.5) lng.) 52.3 (14.9) 53.9 (25.3) 
Numerical solutions (n = 20) (n = 25) (n = 24) (n = 21) 
% correct 16.3 (14.7) 17.3 (15.4) 15.8 (15.1) 15.9 (12.7) 
% attempted 48.7 (32.7) 43.2 (29.0) 44.4 (31.6) 44.1 (30.8) 
% correct of attempted 38.2 (29.4) 40.0 (28.3) 40.4 (27.3) 44.1 (34.2) 
Application of principles (n = 18) (n = 23) (n = 22) (n = 19) 
% correct 3h (rs) 3.8 (5.0) 6.2 (6.8) 3.7 (4.0) 
% attempted 25.1 (23.8) 28.6 (21.6) 26.7 (22.6) 33.2 (28:3) 
% correct of attempted 21.0 (27.3) 12.3 (16.9) 28.5 (28.9) 21.0 (30.3) 
Posttest 
Solvability decisions (true/false items) (n = 22) (n = 20) (n = 15) (n = 20) 
% correct 51.0 (11.8) 50.6 (12.3) 47.8 (19.9) 51.4 (35.3) 
% attempted SOuln(2 223) 92.8 (15.1) WS,3 29.1) 92732273) 
% correct of attempted 61.5 (18.7) 55.0 (11.6) 62.5 (21.8) 29:0 /Q%3)) 
Numerical solutions (n = 18) (n = 22) (n = 15) (n = 20) 
% correct 24.9 (16.6) 33.7 (45.3) 32.0 (61.8) Zw (@iles)) 
% attempted 62.4 (26.3) HOO (GIs) 64.4 (35.2) 73.3 (34.2) 
% correct of attempted 45.8 (32.3) 56.7 (29.5) 49.9 (28.6) 45.0 (32.6) 
Application of principles (n = 21) Gs) (n = 12) (n = 19) 
% correct 6.0 (8.3) 11.8 (9.1) 16.7 (13.6) 9.1 (7.6) 
% attempted 26.4 (17.5) 47.3 (31.0) 42.2 (32.7) 44.7 (30.6) 
% correct of attempted 15.9 (20.7) 31-2 (27.8) 43.3 (28.8) 27.1 24.3) 
in applying geometry rules to diagrams at posttest when they had Log data. We examined log data from the Cognitive Tutor for 
interacted directly with the diagrams during intelligent tutoring hints that were requested during “not given” steps (i.e., steps 
practice. When students were provided with rule—diagram map- during which students must apply a geometry rule to calculate an 
pings in the on-demand hints, interaction was not beneficial. answer). Separate two-way ANOVAs were conducted for percent- 
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Figure 8. Performance on rule application items by experimental conditions. Error bars show the standard error 
of the mean. 
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age of steps on which a hint was requested and for the average 
amount of time spent on each hint. As can be seen in Table 3, there 
were no significant effect main effects or interactions (F's < 1). 
According to log data, all conditions used hints in similar ways 
during intelligent tutoring practice. 

Bivariate correlations were used to explore the potential rela- 
tionship between hint use and eventual posttest performance. Since 
students who request many hints during intelligent tutoring likely 
are struggling with the content in general (Aleven & Koedinger, 
2000), it is not surprising that increased use of tutor hints overall 
was negatively correlated with performance on assessment mea- 
sures (see Table 4). However, an interesting pattern emerges when 
one considers individual conditions. For students who interacted 
with diagrams during intelligent tutoring, spending more time per 
highlighted hint was associated with better application of geometry 
rules (see Table 4); students who interacted with diagrams but saw 
standard hints did not show this pattern. For students who inter- 
acted with the solutions table and saw highlighted hints, spending 
more time per hint was associated with better performance on 
solvability judgments; students who interacted with tables but saw 
standard hints did not show this pattern. These data provide indi- 
rect evidence that visual representations of rule—diagram map- 
pings may be useful, but only if students spend more time pro- 
cessing them. Conversely, these data also may indicate that some 
instructional scaffolds can reduce active processing: Some stu- 
dents may have used the highlighted hints as a shortcut to reduce 
the effort needed to process hint content. 


Discussion 


Results from Experiment 2 demonstrate that interacting with 
diagrams during intelligent tutoring may support spontaneous 
rule—diagram mappings that facilitate understanding of the geom- 
etry rules used during problem solving. Students who saw standard 
hints during intelligent tutoring were more successful in applying 
geometry rules to specific problems at posttest if they had inter- 
acted with the diagram than if they had used a solutions table. For 
students who saw highlighted hints, the locus of interaction did not 
affect their ability to apply geometry rules to problem-solving 
diagrams. Correlational results suggest that students who take the 
time to process visually represented rule—diagram mappings may 
develop better understanding of geometry rules, especially when 
they are interacting with diagram features during problem solving. 

Why didn’t providing students with visual representations of rule— 
diagram mappings support learning of geometry rules when students 
interacted with diagrams? One possibility is that students might have 
processed the highlights in shallow ways, for example, by attending to 
numerical quantities associated with the highlights rather than the 
visual features of the geometry diagrams themselves. Another possi- 
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bility is that providing rule—diagram mappings may reduce generative 
processes during problem-solving practice. When learners interact 
with diagrams during intelligent tutoring, they may be spontaneously 
making connections between text-based hints and visual problem 
features; providing highlighted hints that show these connections may 
negate such generative processing and reduce “desirable difficulty” 
(Bjork, 1994, 1999). This may have occurred even though a poststudy 
survey showed that all students reported similar approaches to the 
problem-solving task and similar (positive) reactions to the support 
that it provided (see the Appendix for a description of the survey and 
its analysis; survey items are provided in Table Al). Thus, students 
may need support in order to process rule—diagram mappings more 
deeply, potentially by engaging in constructive or interactive activities 
(Chi, 2009). Since increased time per hint provides circumstantial 
evidence of active learning processes (Shih, Koedinger, & Scheines, 
2008), the correlational results support this possibility: Students who 
interacted with diagrams and engaged in active processing of the 
rule—diagram mappings (i.e., spent more time with the highlighted 
hints) given to them were better able to apply geometry rules to 
problem diagrams. Thus, a key question is whether scaffolding active 
processing of the rule-diagram mappings results in better understand- 
ing of geometry rules. Experiment 3 was designed to test this question 
by systematically varying whether visually based rule—diagram map- 
pings were required and, if they were, by varying whether they were 
provided to students or generated by students. 


Experiment 3 


In this experiment, we examined the impact of providing stu- 
dents with rule—diagram mappings or requiring students to gener- 
ate rule-diagram mappings using geometry problem-solving dia- 
grams. In order to control for student attention to visual diagrams 
during problem solving, the locus of interaction was the geometry 
diagram for all conditions in this experiment (i.e., all students used 
the interactive diagram version of the Cognitive Tutor). The in- 
stantiation of rule—diagram mapping using highlighted diagram 
features was kept the same as in the previous experiment. In order 
to increase the frequency with which students encountered/gener- 
ated these rule—diagram mappings, we modified the system so that 
the mappings occurred after each error in the Cognitive Tutor. 
Errors tend to be more frequent than hint use, although both tend 
to occur when students are working on steps for which they lack 
adequate knowledge. 


Method 


Participants. Participants were 83 students from five 10th 


grade geometry classrooms at the same vocational school as in 
Experiments | and 2 but in a different academic year (i.e., a unique 


Experiment 2: Log Data (Means and Standard Deviations) Associated With Hint Use During Intelligent Tutoring 
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Cognitive Tutor log measure (n = 17) (n = 19) (n = 20) n= 11) 
Average time (in seconds) per hint 14.0 (11.0) 16.1 (13.4) 12.5 (6.9) 137653) 
% of steps for which a hint was requested Pi (AND) 31.3 (9.4) S220 Oe) 30.5 (9.8) 
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Table 4 


Correlations Between Hint Use During Practice and Posttest Item Performance (% Correct of 


Attempted) 
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Overall (n = 44) 
% steps for which hint was used 
Average time (seconds) per hint 
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% steps for which hint was used 
Average time (seconds) per hint 
Standard hints, diagram interaction (n = 10) 
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group of students). All five classes were taught by the same 
teacher. Grade-matched triplets of students were identified within 
each class, using students’ first semester geometry grades. From 
every grade-matched triplet, students were randomly assigned to 
one of three experimental conditions, described below. 

Materials. 

Geometry Cognitive Tutor. The Geometry Cognitive tutor 
interfaces varied only in whether or not rule—diagram mapping 
was enforced following students’ errors and whether students 
generated or were provided with these mappings. The three con- 
ditions were as follows. 

Student-generated mapping. The purpose of the student map- 
ping condition was to investigate the impact of requiring students 
to generate visually based rule—diagram mappings. To generate 
these mappings, students selected the diagram features relevant to 
the specific geometry rules that were used to solve problem sub- 
goals in the Cognitive Tutor. Students selected diagram features by 
clicking on the relevant diagram elements in the tutor’s interactive 
diagram; clicking an element highlighted it in the diagram (see 
Figure 9). If a student entered an incorrect answer or reason during 
practice, she or he was locked out of the numerical solution field 
until she or he identified a correct geometry rule to justify the 
problem-solving step. If the rule also was entered incorrectly, 
students were required to revise their entry until a correct rule had 
been identified. Once a correct geometry rule was entered, students 
were required to highlight the diagram elements relevant to that 
rule (see Figure 9), forming an integrated rule—diagram represen- 
tation. The tutor scaffolded rule—diagram mappings by prompting 
students to highlight each diagrammatic feature that was necessary 
to apply a named geometry principle. 

For example, in Figure 9, after making an error in calculating 
the measure for Angle ABC, the student has selected the “Interior 
Angles Same Side” principle. The tutor has generated answer 
fields for all the diagrammatic features necessary to apply that 
principle: parallel lines, a transversal (that cuts the parallel lines), 
and two angles that are created by the intersection of the transver- 
sal with the parallel lines. The student has selected the parallel 


lines and the transversal in Figure 9a and must now select (i.e., 
click on) each relevant angle to complete the highlighting seen in 
Figure 9b. Students received immediate feedback on each high- 
lighted feature. Incorrect highlights turned red in the diagram and 
the accompanying answer area. Students were required to revise 
incorrect highlights until a correct, highlighted representation was 
generated. Correct highlights remained visible until the problem- 
solving step was completed. 

Tutor-provided mapping. This condition utilized the same vi- 
sually based rule—diagram mappings as the student highlighting 
condition, but in this case the mappings were provided by the tutor. 
Following an error, students were required to identify the correct 
geometry rule before completing any other step. As in the student- 
generated mapping condition, students were required to revise 
their entries for the geometry rule until a correct rule had been 
identified. Once a student had identified the geometry rule that 
justified the problem-solving step, the tutor automatically high- 
lighted the diagram features necessary to apply that rule. In order 
to remain consistent with the information in the student-generated 
mapping condition, the tutor provided a textual list (1.e., correctly 
completed answer fields) of the highlighted diagram features in the 
adjacent work area. Tutor-provided highlighting and the displayed 
answer fields were identical to the final student-generated high- 
lighting in a problem-solving step (see Figure 9b). 

No mapping (control). The control condition was the diagram 
interaction condition from Experiments | and 2. This condition did 
not involve any highlighting of visual diagram features by either 
students or the ITS. Students completed numerical solutions and 
selected geometry rules for each problem-solving step. 

Assessments. Assessments followed the same format as in 
Experiment 1. 

Statistical analyses. A series of three repeated-measures 
ANOVAs were conducted where the independent variable was 
mapping condition (student-generated mapping, tutor-provided 
mapping, or no mapping) and the repeated factor was test time 
(immediate posttest, delayed posttest). As in Experiments | and 2, 
a separate analysis was conducted for each of the assessment types 
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(solvability decisions, numerical solutions, and rule application 
items), and alpha levels were set at .017 using a Bonferroni 
correction (.05/3 = .017). 

Procedure. The procedure was the same as in Experiment | 
except that students were given up to 45 min to complete the 
posttest and 30 min to complete the delayed posttest. 


Results 


As a check of random assignment, a multivariate analysis of 
variance (MANOVA) was conducted for pretest scores where 
condition was the independent variable and percent correct of 
attempted items for solvability decisions, numerical solutions, and 
rule application were the dependent variables (all were within 
acceptable limits for kurtosis and skewness: between + 2). There 
were no significant differences between conditions on pretest 
performance (numerical solutions: F(5 79, = 1.0, p > .36; all other 
item types: Fs < 1), as can be seen in Table 5. Thus, pretest data 
were not analyzed further. 

Overall, 34 students were present for both the posttest and 
delayed posttest during the course of the study. Again, the attrition 
rate was high but comparable to that in other studies at the school 
(Salden et al., 2010; Walker et al., 2009) and comparable across 
conditions (see Table 5). Means and standard deviations are pre- 
sented in Table 5. 

Solvability decisions. There were no main effects of test time 
or condition (Fs < 1). The interaction between test time and 
condition did not reach statistical significance (Fi, 31) = 2.9, p = 
-OFe ii pte 6). 
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Numerical solutions. There was no main effect of test time 
(Fa. 31) = 14, p = .24, np = .04) or condition (F < 1) and no 
interaction between test time and condition (F(5, 3,) = 1.2, p = .31, 
Np = 07). 

Rule application. Although there were no main effects of 
test time or condition (F's < 1), there was a significant test time 
by condition interaction (F,5, 3;) = 4.7, p = .016, nj = .23; see 
Figure 10). Tukey-Kramer post hoc comparisons showed a 
significant difference between student-generated mapping and 
tutor-provided mapping at delayed posttest (p < .05); the no- 
mapping condition fell between the other groups at delayed 
posttest and was not significantly different from either (see 
Table 5). There were no significant group differences at imme- 
diate posttest. 


Discussion 


Results show that student-generated rule-diagram mappings 
supported better long-term understanding of geometry rules as 
evidenced by students’ abilities to apply domain principles 
(geometry rules) to specific problem representations (geometry 
diagrams) at delayed posttest. The same pattern was seen for 
students’ performance on solvability decisions, although the 
effect did not reach the level of statistical significance. These 
results highlight a trade-off between generative processing and 
immediate outcomes. Requiring students to generate their own 
rule—diagram mappings during practice may have initially 
added demands that compromised immediate performance com- 
pared to the other conditions. However, significant differences 
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Table 5 


Experiment 3 Means (and Standard Deviations) for Assessment Items 
a a BE tl I 








Item type Student-generated mapping Tutor-provided mapping No mapping (control) 
Pretest 
(n = 28) (n = 29) (n = 25) 
Solvability decisions 
% correct 57.8 (15.5) 55.4 (16.9) 58.3 (15.0) 
% attempted 100 (0) 99'6/(2:3)) 100 (0) 
% correct of attempted 57.8 (15.5) Soe) 58.3 (15.0) 
Numerical solutions 
% correct 28.6 (25.4) 21.3 (24.4) DT (23:9)) 
% attempted 61.3 (38.6) 48.9 (30.4) Soo") 
% correct of attempted 49.8 (33.8) 36.6 (34.4) 44.0 (35.5) 
Rule application 
% correct 18.8 (12.7) 20.8 (16.8) 19.1 (12.7) 
% attempted 99.9 (0.6) 99.5 (23.7) 99.9 (0.6) 
% correct of attempted* 18.6 (2.4) 20.7 (16.6) 18.9 (12.3) 
Posttest 
(n = 10) (n = 11) (n = 13) 
Solvability decisions 
% correct 42.5 (20.2) Slee) 59.6 (24.6) 
% attempted 100 (0) 96.6 (11.3) 94.2 (20.8) 
% correct of attempted 42.5 (20.2) 54.1 (21.8) 62.5 (20.4) 
Numerical solutions 
% correct 21.7 (19.3) 28.8 (25.9) 39.1 (29.9) 
% attempted 36.3 (22.2) 44.9 (19.7) 5/2 (832) 
% correct of attempted Bion 16) 48.5 (28.7) 55.0 (34.6) 
Rule application 
% correct 18.4 (13.7) 22.7 (18.5) DEA Clneo)) 
% attempted 99.7 (10.4) 96.6 (11.3) 94.0 (20.7) 
% correct of attempted 18.5 (13.8) 23.4 (18.2) DOUG) 
Delayed posttest 
(n = 10) (n = 11) (n = 13) 
Solvability decisions 
% correct S7/0123.0)) 52.3 (20.8) 46.2 (18.7) 
% attempted 100 (0) 100 (0) 93.4 (21.7) 
% correct of attempted Seo1(2320) 52.3 (20.8) 51.0 (24.7) 
Numerical solutions 
% correct 26.3 (21.6) 20.5 (14.0) D2 lS) 
% attempted 52.5 (20.2) 56.3 (23.0) 66.7 (27.4) 
% correct of attempted 43.2 (28.8) 3822.1) 36.4 (28.3) 
Rule application 
% correct 29.4 (12.2) lees) 19.7 (16.3) 
% attempted 100 (0) 100 (0) 93.8 (21.7) 
% correct of attempted* 29.0 (12.0) 16.9 (14.2) 19.3 (16.0) 


* Percent correct of attempted can be lower than percent correct if students attempt to solve an unsolvable item. 


emerged 1 month later, where students who generated rule— 
diagram mappings demonstrated the strongest performance on 
application items, especially compared to students who were 
provided with these mappings. 

In this study, all conditions utilized the interactive diagram 
version of the Cognitive Tutor. Thus, all students were attend- 
ing to and interacting with the diagram as they engaged in 
problem-solving practice and, as a consequence, could have 
engaged in some form of spontaneous rule—diagram mapping. 
In Experiment 3 we sought to determine if generating or pro- 
viding visual representations of rule—diagram mappings across 
problems could improve student learning more than spontane- 
ous mappings facilitated by interactive diagrams. Results show 
that students who were scaffolded in generating rule—diagram 
mappings gained benefits beyond those gained by students who 


were provided with the rule—diagram mappings, even though 
those gains were not apparent at immediate posttest. The cur- 
rent findings demonstrate that visually based interactions can be 
used to support understanding of domain principles that justify 
problem-solving steps, especially at longer retention intervals. 
It should be noted that although the pattern of performance in 
Figure 10 may seem to suggest that students improved their 
understanding of geometry rules from posttest to delayed post- 
test, this likely is an artifact of the types of items on the 
immediate versus delayed posttest (where delayed posttest 
items included less complex diagrams). If less complex dia- 
grams resulted in somewhat “easier” items at delayed posttest, 
it would be reasonable to observe increases in performance if 
students largely retained (rather than increased) their knowl- 
edge. 
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Figure 10. Performance on rule application items at posttest and delayed posttest by experimental condition. 


Error bars show the standard error of the mean. 


General Discussion 


Overall, this research demonstrates that advanced learning tech- 
nologies can facilitate robust student learning via well-designed 
interactions that support students in making visually based map- 
pings between domain principles and problem representations. 
Moreover, these studies demonstrate the potential impact of new 
learning technologies to use interactive visual components to sup- 
port student learning. However, the research also demonstrates that 
interaction itself is not sufficient to increase learning: Interactions 
should be carefully designed to require students to generate con- 
nections between problem features and higher level domain prin- 
ciples. In this case, generating visual representations of the con- 
nection between geometry rules and diagram features, rather than 
having them provided, resulted in better long-term understanding 
of those rules. 

Findings from Experiment 3 demonstrate the potential to design 
interactive elements that scaffold students’ reasoning about spe- 
cific problems in relation to domain knowledge. Learning about 
geometry rules was facilitated when students were scaffolded in 
generating representations (1.e., highlighted diagrams) that con- 
nected diagram features to geometry postulates and theorems. 
However, one might question the robustness of this finding since 
there were not significant differences between the student- 
generated mapping and the no-mapping conditions in Experiment 
3. Why shouldn’t scaffolding rule-diagram mapping be signifi- 
cantly better than no mapping? Most likely, some students in the 
no-mapping condition did not generate rule—diagram mapping 


while others spontaneously engaged in such rule—diagram map- 
ping. However, when the tutor provided the mappings, far fewer 
students engaged in generative processing. More simply, it is 
likely that no generative mapping was occurring in the tutor- 
provided mapping condition, some generative mapping may have 
been happening in the no-mapping (interactive diagram) condition, 
and generative mapping was required in the student-generated 
mapping condition. The degree to which relevant, generative ac- 
tivity is required by the tutoring interface predicts the pattern of 
performance on rule application items at delayed posttest (student- 
generated mapping > no mapping > tutor-provided mapping). 
However, it also should be noted that the tutoring interface 
strongly scaffolds rule—diagram mapping in this study (by prompt- 
ing students to highlight each relevant diagram feature); thus, 
scaffolded reasoning steps may be contributing to the results 
independent of visually based generation. Future research is 
needed to explore the relative contributions of these factors. 

The difficulty of designing additional interactions in a step- 
based ITS that result in further improvements to student learning 
has been documented by VanLehn (2011). VanLehn noted that the 
preponderance of evidence has demonstrated that attempts to scaf- 
fold student thinking to levels of granularity finer than step-based 
reasoning in ITS have not impacted student performance. That is, 
ITSs that require step-based and substep-based responses (where 
substep-based responses ask for reasoning behind a step) typically 
produce equivalent learning gains (VanLehn, 2011; VanLehn et 
al., 2007). In the current research, the lack of impact on problem- 
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solving success—specifically, performance on numerical solu- 
tions—is consistent with VanLehn’s (2011) findings. But the cur- 
rent research goes beyond previous findings to demonstrate that 
generative interactions that serve to connect problems and domain 
principles can facilitate longer lasting knowledge of the reasons 
that underlie problem-solving procedures, even when these inter- 
actions support substep-based reasoning. In the current research, 
steps required students to provide numerical answers and name 
relevant geometry rules, whereas substeps required students to 
reason about how selected rules function within a specific step. At 
the substep level, students mapped the connection between a 
domain-level rule (relevant to the step) and a specific problem 
diagram using more fine-grained reasoning (e.g., identifying the 
diagram features that were necessary to apply the rule in a partic- 
ular step). Since even the control conditions in the current studies 
enforced accuracy at step levels (finding numerical answers and 
identifying geometry rules), we would not expect better rule— 
diagram mapping (at the substep level) to result in greater solution 
accuracy. However, we should expect rule—diagram mapping to 
help students achieve a deeper understanding of how and when 
specific geometry rules are used within problem-solving steps 
(1.e., improved performance on rule application items). Exper- 
iment 3 demonstrates that visual interactions can be an effective 
method to promote substep-level reasoning about domain rules, 
resulting in better long-term understanding of these rules. Find- 
ings from the current studies also demonstrate that substep 
interactions are sensitive to small changes in the focus of 
attention and in the amount of generative processing required 
by the interaction, making large effects difficult to achieve. 
Future research should continue to explore the potential for 
generative interactions to facilitate finer-grained reasoning, 
with specific exploration of visually based interactions in 
STEM areas that utilize diagrams or visual models. 

The current results are consistent with prior research showing 
that students using an ITS can develop (shallow) problem-solving 
skills that allow them to be largely successful in reaching numer- 
ical solutions to problems without fully understanding the geom- 
etry rationale for these solutions (Aleven & Koedinger, 2002). A 
similar finding has been noted for an ITS in physics (Andes): 
Although research studies spanning 5 years found that students 
using the Andes system scored higher than control students on 
conceptual measures, accuracy of numerical answers never was 
affected by tutor use (VanLehn et al., 2005). The current findings 
confirm that making accurate connections to geometry rules is not 
always critical to determining a correct numerical answer and that 
effective interactions in an ITS should target the development of 
conceptual knowledge. In this work, scaffolding mapping between 
domain principles (i.e., geometry rules) and problem features (.e., 
diagram configurations) helped students develop problem-solving 
skills that were more closely tied to domain knowledge and, 
accordingly, represent movement away from shallow problem- 
solving. 

It should be noted that the current results do not show compel- 
ling evidence that deep learning was achieved; overall, students’ 
raw levels of performance were relatively low, and students chose 
to skip a fair number of problems during assessment. As noted in 
the Limitations section, this may be due to the specific (lower 
performing) student population that participated in this research; 
however, it also may suggest that additional interventions are 
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needed to move students more decisively toward a level of under- 
standing that could be described as “deep” rather than simply as 
“less shallow.” 

Results from the current studies extend previous findings 
showing that pre-integrated materials can be less useful than 
materials that require students to actively generate an integrated 
representation (Bodemer et al., 2005); however, our results go 
beyond previous findings to show that the benefits of interac- 
tivity are moderated by the focus of student attention during 
interactivity as well as the target of the interactive actions. 
Although both Experiments 1 and 3 required students to gen- 
erate rule—diagram mappings, only Experiment 3 required stu- 
dents to form a visual representation of the rule—diagram map- 
ping using elements in the problem diagrams. The explanations 
used in Experiment | (where students clicked on numerical 
quantities to “name” diagram elements) did not direct students’ 
attention to the key visual features involved in mapping and 
may have been redundant with problem-solving strategies. The 
interaction technique used in Experiment 3 avoids this short- 
coming by comprehensively targeting the visual features rele- 
vant to a principle’s application but is still a lightweight form of 
interaction that avoids placing undue demands on students 
during problem-solving practice (1.e., students can easily select 
diagram elements by clicking). 

Results from Experiment 2 mainly were consistent with pre- 
vious research showing that visual cues can equalize the ben- 
efits of providing versus generating instructional explanations 
(de Koning, et al., 2010). In Experiment 2, students who did not 
receive visual mappings benefited from interacting directly 
with the diagrams (as opposed to a solutions table), likely 
because they engaged in some spontaneous processing of rule— 
diagram mappings. But students who were provided with visual 
representations of the rule-diagram mappings (in the form of 
highlighted hints) did not benefit from diagram interaction (i.e., 
locus of interaction did not affect rule application when stu- 
dents saw highlighted hints). Providing rule—diagram mappings 
for students who interact with problem diagrams may reduce 
the degree to which they actively generate rule—diagram map- 
pings on their own, or it may facilitate or invite shallow 
strategies. Although we do not have direct evidence to distin- 
guish between these possibilities, results from Experiment 3 
demonstrate that the generation of representations that depict 
rule—diagram mappings is more effective than providing such 
representations. In Experiment 3, prompting students to gener- 
ate visual representations of rule-diagram mappings, rather 
than providing students with these mappings, resulted in better 
understanding of geometry rules at delayed posttest. This result 
occurred even though content and timing of the representations 
were equivalent. Overall, the current findings extend previous 
work by demonstrating that providing visual cues can be effec- 
tive when students are not already attending to relevant diagram 
features, but scaffolding student generation of domain-relevant 
representations is more effective for long-term learning. 

The implications of the current findings may extend beyond 
geometry to other domains where varied visual representations are 
used to reason about domain concepts (e.g., chemistry). Overall, 
this research suggests that scaffolding visual interactions in prob- 
lem diagrams can be an effective way to support students’ under- 
standing of how domain concepts apply to specific problems, 
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especially when such interactions build a representation that dem- 
onstrates the domain-level concept(s) operating on the problem. In 
other domains, different interactions may be necessary to achieve 
meaningful coordination between visual features and domain prin- 
ciples. For example, visual representations of chemistry molecules 
may need to be rotated or labeled during problem solving (Stieff, 
Ryu, & Dixon, 2010). More research is needed to understand 
rule—diagram mapping and its influences on learning in other 
domains. 

Finally, it is important to remember that the benefits of student- 
generated rule—diagram mappings were not evident until delayed 
posttest, which occurred a month following the immediate posttest. 
Thus, the current work argues for the importance of assessing at 
longer delays when attempting to evaluate the potential of instruc- 
tional interventions. 


Limitations 


The current research is not without its limitations. First, 
students at the vocational school where the studies were con- 
ducted may represent a lower achieving population compared to 
students in traditional school settings. This may help account 
for the relatively low overall levels of performance (when one 
considers raw scores) and of the percentage of problems that 
students attempted to solve. Although the potential to support 
learning with lower achieving populations is not trivial, future 
research should explore the impact of rule-mapping interactions 
with typical student populations. Second, absenteeism was a 
recognized problem at this vocational school. Although the 
rates of absenteeism seen during this study were similar to those 
in other studies conducted at the school (Salden et al., 2010; 
Walker et al., 2009), a more sensitive picture of tutor impact 
may be seen in educational contexts with consistent attendance. 
Third, the tutor language in these studies could have been more 
closely aligned to the formal language of geometry. Using the 
label rule to refer to the geometry postulates and theorems that 
justified problem-solving steps reflects an informal use of lan- 
guage that may have complicated students’ reasoning. This was 
especially true in Experiment 1, where “rules” were associated 
with “applied to” fields. Interpretation of the “applied to” label 
required students (a) to recognize that this field targeted kn- 
own (or previously solved) information rather than the un- 
known quantity and (b) to (perhaps unnaturally) separate the 
known arguments (e.g., two angles) from the unknown quantity 
during application of a geometry postulate or theorem. Future 
research should more carefully align the language within the 
tutoring environment to the language of the domain (e.g., draw- 
ing upon language from geometry proofs to require a “reason” 
rather than a “rule”). Finally, the research was conducted during 
a focused set of instructional units in the geometry curriculum. 
It remains to be seen how longer term use of rule—diagram 
mappings in an ITS affects student outcomes. Scaffolding of 
students’ rule-diagram mappings may need to be removed as 
students gain competence with the task, since other research has 
shown that fading instructional support (Atkinson et al., 2003; 
Salden, Aleven, Schwonke, & Renkl, 2008) can improve stu- 
dent learning and knowledge transfer. 
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Conclusions 


The current research informs the development of advanced 
learning technologies for domains that include visual and verbal 
information sources. Our work demonstrates that designing 
student interactions that promote learning requires careful at- 
tention to the representational forms with which students may 
interact. Although connecting diagram elements to domain rules 
via student-generated highlights supported long-term learning 
about these rules (Experiment 3), making these same connec- 
tions by interacting with solved quantities was ineffective (Ex- 
periment 1). Interacting directly with diagrams appeared to 
facilitate spontaneous processing of rule-diagram mappings, 
but providing visual representations of rule—diagram mappings 
negated the effects of interaction (Experiment 2). Providing 
visual representations of rule-diagram mappings was not as 
effective as scaffolding student generation of these mappings 
(Experiment 3). Together, these results demonstrate that at- 
tempts to explore the assistance dilemma (Koedinger & Aleven, 
2007) may require attention not only to the kinds of information 
being provided or withheld but also to the representational 
format of the information and the interactions available to 
promote deeper processing of the representations. Overall, our 
findings show that there are potential benefits in learning tech- 
nologies that facilitate student interaction with multimedia and 
visual representations, especially when these interactions focus 
student attention and processing on key domain concepts. 
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Appendix 


Experiment 2 Poststudy Survey Description and Analysis 


A 14-item Likert-style survey was used to gauge students’ reactions to the intelligent tutoring system using 
a 6-point scale (from 1 = totally disagree to 6 = totally agree). Multivariate results showed no significant 
main effect of the locus of interaction (F(,3, 49) = 1.1, p > .40) or tutor format (F < 1) and no interaction 
among the independent variables (F(,3. 49) = 1.3, p > .27). The first item (“I am color-blind ... ”) was 
dropped from this analysis based upon kurtosis and skewness; all other questions were within acceptable limits 


for kurtosis and skewness (between = 2). 


(Appendix continues) 
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Table Al 
Experiment 2 Poststudy Survey Items: Means (and Standard Deviations) for Responses by Condition 
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Item (1 = totally disagree, 6 = totally agree) 


1 


~— 


oo 


\o 


— 
—_ 


. Lam color-blind or I have difficulty 
seeing colors. 

. I thought that the tutor was easy to use. 

. It was easy to lose track of what angle I 
was working on while using the tutor. 

. The question marks in the images helped 
me figure out what I needed to solve. 

. It was easy to figure out the order in 
which I needed to find the answers in 
each problem. 

. | paid a lot of attention to the diagram 
while I was solving problems. 

. It was easy to understand the hints that I 
got on the computer. 

. The hints made the diagrams easier to 
understand. 

. L used the hints more during this study 
than I normally do when using the 
Carnegie Learning tutor. 

. The geometry rules in the glossary were 
easy to understand. 

. It is better to ask my teacher for help than 
to ask for a hint on the computer. 

. I think I learned a lot by using the tutor. 

. The problems on paper (the posttest) were 
harder than the problems in the tutor. 

. I liked participating in this study. 


Table interaction 


Standard hints 
(n = 15) 


1.0 (0) 
4.0 (1.3) 
Zen (2) 


Sales) 


4.3 (1.3) 
4.0 (1.5) 
4.3 (2.0) 


4.5 (1.5) 


4.4 (1.5) 
4.5 (1.4) 


2.9 (1.8) 
Sealey) 


4.3 (1.9) 
3.0 (1.6) 


Highlighted hints 
(n = 15) 


1.5 (1.4) 
3.9 (1.4) 
4.1 (1.5) 


4,2 (1.5) 


3.5 (1.5) 
4.2 (1.1) 
3.1 (1.3) 


Si UEP) 


4.4 (1.8) 
Sites) 


Siey((iles)) 
3.7 (1.4) 


4.2 (1.6) 
2.8 (1.9) 


Diagram interaction 


(n = 17) 
1.2 (0.6) 
4.4 (1.1) 
3.4 (1.2) 


3.8 (2.0) 


2.9 (1.7) 
4.1 (1.3) 
3.7 (1.3) 


3.8 (1.1) 


4.1 (1.5) 
4.0 (1.2) 


2.9 (1.9) 
3. o) (lel) 


3,7 (1.9) 
3.7 (1.8) 


Standard hints Highlighted hints 


(Gi =) 
1.0 (0) 

4.1 (0.6) 
Syl (3) 


4.3 (1.3) 


3.3 (1.4) 
4.2 (0.8) 
4.0 (1.2) 


4.3 (1.4) 


3.8 (1.6) 
Sy) ES) 


1.8 (1.2) 
3.8 (1.0) 


4.0 (2.0) 
3.6 (1.2) 
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Writing Pal: Feasibility of an Intelligent Writing Strategy Tutor 


in the High School Classroom 


Rod D. Roscoe and Danielle S. McNamara 


Arizona State University 


The Writing Pal (W-Pal) is a novel intelligent tutoring system (ITS) that offers writing strateg 

instruction, game-based practice, essay writing practice, and formative feedback to developing writers. 
Compared to more tractable and constrained learning domains for ITS, writing is an ill-defined domain 
because the features of effective writing are difficult to quantify and individual writers can employ 
diverse strategies to achieve similar goals. The development of an ITS in an ill-defined domain presents 
particular challenges regarding comprehensive instruction, modularized content, extended practice, and 
formative feedback. In this article, we describe how the development of W-Pal has uniquely addressed 
these concerns and present the results of a study assessing the feasibility of this system in high school 
English classrooms. This study included 2 teachers and their 141 10th grade English class students who 
utilized W-Pal over a 6-month period during the academic year. Log-file analyses showed that students 
used all aspects of W-Pal, but activity and engagement was uneven throughout the year and decreased 
over time. Essay scores improved over time and surveys indicated that students perceived the lessons, 
games, and feedback as beneficial. However, specific aspects of the learning environment were critiqued 
as annoying, challenging, or lacking specificity. Overall, the results suggest that the system was generally 
well-received by the students but also offer insights for the development of ITSs in ill-defined domains. 


Keywords: intelligent tutoring systems, writing instruction, usability and feasibility testing, ill-defined 


learning domains 


Intelligent tutoring systems (ITSs) provide adaptive, interactive, 
computer-based support for learning based on sound pedagogical 
principles (Graesser, McNamara, & VanLehn, 2005), and educa- 
tors now have access to effective intelligent tutors in domains such 
as mathematics (Beal, Arroyo, Cohen, & Woolf, 2010), geometry 
(Aleven & Koedinger, 2002), biology (Michael, Rovick, Glass, 
Zhou, & Evens, 2003), physics (Graesser et al., 2004; VanLehn et 
al., 2005), computer literacy (Graesser et al., 2004), reading com- 
prehension (McNamara, O’Reilly, Best, & Ozuru, 2006), and 
foreign language (Gamper & Knapp, 2002; Johnson & Wu, 2008). 
In this study, we examine the Writing Pal (W-Pal), an ITS that 
offers writing strategy instruction along with game-based practice, 
essay writing practice, and formative feedback to high school 
students. Historically, ITS development has focused on well- 
defined learning domains, in which fundamental concepts, proce- 
dures, and evaluation criteria are relatively constrained. In con- 
trast, writing is an ill-defined learning domain because the features 
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of skilled writing are difficult to quantify, and individual writers 
may employ diverse strategies to achieve similar goals. 

A particular focus of this study is how high school students 
perceive intelligent tutoring of writing in the classroom (Grimes & 
Warschauer, 2010). For ill-defined domains, in which evaluations 
of students’ work are inherently debatable, such subjective reac- 
tions are crucial. Students who rebuff the ITS are unlikely to 
engage with the system over meaningful periods of instruction 
(i.e., several weeks, a semester, or a school year). Thus, we assume 
that feasibility depends upon whether the system is perceived as 
valid and valuable. At this stage in W-Pal’s development, an 
experimental test of instructional efficacy was not warranted. 
Rather, it was most important for us to examine a) how and 
whether students use the W-Pal over time and b) students’ per- 
ceptions of the utility and design of W-Pal. These data help to 
define the feasibility of the system and inform later development 
and deployment. 


Computer Support for Writing Instruction 


Several technologies have been developed to support students’ 
writing by grading essays (Grimes & Warschauer, 2010; Shermis 
& Burstein, 2003), teaching summarization (Kintsch, Caccamise, 
Franzke, Johnson, & Dooley, 2007) and argumentation skills 
(Wolfe, Britt, Petrovich, Albrecht, & Kopp, 2009), or scaffolding 
essay composition (Proske, Narciss, & McNamara, 2012; Rowley 
& Meyer, 2003). An important question is how well technologies 
address the pedagogical needs arising from the ill-defined nature 
of writing. Ill-structured problems possess ambiguous goals, solu- 
tion paths, or assessment criteria (Simon, 1973). Lynch, Ashley, 
Pinkwart, and Aleven (2009, p. 258) argued that learning domains 
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are ill-defined when “essential concepts, relationships, and proce- 
dures for the domain” and the “means to validate problem solu- 
tions or cases” are not specified by a single strong domain theory. 
There may be multiple conceptualizations of key problems and 
tasks and there may be multiple approaches for solving those 
problems. Given such ambiguity, assessment of solutions may also 
be context-dependent and subjective. Thus, ITSs in ill-defined 
domains must not only address the challenges common to any 
educational technology, but must also overcome unique hurdles 
that arise when appropriate content, tasks, evaluation, and feed- 
back are uncertain. 

The ill-defined nature of writing emerges from the many non- 
linear and interactive tasks that comprise the writing process 
(Deane et al., 2008; Flower & Hayes, 1981). For example, pre- 
writing involves generating and organizing ideas prior to writing, 
and drafting involves translating initial ideas and plans into co- 
herent text. In persuasive writing, writers must frame their argu- 
ments precisely and objectively and support arguments with fac- 
tual evidence. Subsequently, revising entails elaborating and 
reorganizing the text to improve overall quality. Throughout these 
stages, writers also develop cohesion, style, voice, and other global 
qualities. To help students navigate these complex demands, writ- 
ing pedagogy emphasizes the importance of strategy instruction 
that equips students with (a) concrete strategies for diverse writing 
processes, (b) background knowledge for using the strategies, and 
(c) opportunities for extended practice (Graham, McKeown, Kiu- 
hara, & Harris, 2012; Graham & Perin, 2007). Effective interven- 
tions teach explicit strategies for planning, drafting, editing, and 
summarizing, along with information about how and why the 
strategies should be used (De la Paz & Graham, 2002). 

Another aspect of the ill-defined nature of writing is the sub- 
jectivity of evaluation. Every essay exhibits unique content and 
errors that represent individual students’ writing processes. To 
assign a score, essay graders (e.g., teachers) must interpret the 
appropriateness of these decisions in the context of the assignment. 
Writing assessment research has found this process to be challeng- 
ing (Huot, 1996; Meadows & Billington, 2005). Over time and 
multiple instances of grading, human graders are unlikely to assign 
the same grades to the same essays consistently unless carefully 
trained to do so (Crossley & McNamara, 2011; Meadows & 
Billington, 2005). Such subjectivity also raises questions about 
how to give meaningful feedback. Research has emphasized the 
importance individualized, formative feedback that describes clear 
methods for improvement (McGarrell & Verbeem, 2007; Shute, 
2008), such as strategies for developing arguments and evidence. 
In contrast to summative feedback on overall performance, forma- 
tive feedback supports writing proficiency by making the means of 
progress explicit. 

An analysis of writing instruction from the perspective of ill- 
defined learning domains thus suggests several design principles that 
are germane to any writing ITS. An intelligent writing tutor may need 
to combine (a) comprehensive strategy instruction across multiple 
phases of writing, (b) modularized content to accommodate different 
pedagogies or student needs, (c) opportunities for extended and varied 
writing practice, and (d) formative writing feedback related to writing 
proficiency and strategies. In the following sections, we consider how 
prior technologies have addressed these issues, and then discuss how 
these design principles have been uniquely implemented within the 
W-Pal tutoring system. 


Automated Essay Scoring and Writing Evaluation 


A significant challenge for computer-based writing instruction 
is the automated assessment of student writing and delivery of 
meaningful feedback. One advantage is that computer-based tools 
can evaluate many text features consistently and simultaneously, 
and apply the same criteria to all essays reliably and objectively. 
Indeed, automated essay scoring (AES) systems have been devel- 
oped to facilitate essay grading using statistical modeling, machine 
learning, natural language processing (NLP), and latent semantic 
analysis (LSA). Prominent systems include e-rater (Attali & Bur- 
stein, 2006), IntelliMetric (Rudner, Garcia, & Welch, 2006), and 
Intelligent Essay Assessor (IEA; Landauer, Laham, & Foltz, 
2003). Overall, AES scoring tends to be accurate. Human and 
computer-assigned scores correlate around .80 to .85 (Warschauer 
& Ware, 2006), with 40-60% perfect agreement (exact match of 
human and computer scores) and 90—100% adjacent agreement 
(human and computer scores within | point; e.g., Attali & Bur- 
stein, 2006; Dikli, 2006; Rudner et al., 2006). Over time, AES 
systems have become embedded within automated writing evalu- 
ation (AWE) systems that assign scores along with feedback on 
errors (e.g., spelling) and may include instructional scaffolds and 
learning management tools (Grimes & Warschauer, 2010). Exam- 
ples include Criterion (e-rater scoring engine) from the Educa- 
tional Testing Service (Burstein, Chodorow, & Leacock, 2004), 
MyAccess (IntelliMetric engine) from Vantage Learning (Grimes 
& Warschauer, 2010), and WriteToLearn (IEA engine) from Pear- 
son Education (Landauer, Lochbaum, & Dooley, 2009). 

Evaluations of AWE technologies have focused primarily on scor- 
ing accuracy, although a few studies have examined instructional 
efficacy. For example, Shermis, Burstein, and Bliss (2004) examined 
essay scores for over 1000 high school students, half of whom 
participated in typical classroom instruction and half of whom used 
Criterion. The two groups did not differ in holistic essay quality, 
although Criterion users produced longer essays with fewer mechan- 
ical errors. Rock (2007) obtained comparable results in a study with 
over 1,400 ninth grade students using Criterion. Finally, Kellogg, 
Whiteford, and Quinlan (2010) experimentally manipulated how 
much feedback 59 undergraduates received from Criterion on three 
essays. Students received feedback on all essays, one essay, or none. 
Holistic essay quality did not differ across conditions, although stu- 
dents who received more feedback displayed fewer mechanical errors 
in their essay revisions. In sum, Criterion’ has been successful in 
improving student essays but primarily for mechanical properties, 
rather than holistic quality. 

Grimes and Warschauer (e.g., Grimes & Warschauer, 2010; 
Warschauer & Grimes, 2008) have argued for the need to examine 
users’ perceptions of AWE tools in the classroom. Successful 
deployment of writing technologies may depend upon whether 
teachers and students view the tools as valid, useful, and usable. 
Within this framework, Warschauer and Grimes (2008) examined 
perceptions of Criterion or MyAccess in four schools, obtaining 
survey and interview data from principals, teachers, and students 
(sixth to 12th grade). Both systems were perceived to increase 
students’ motivation to write and improve writing quality, but the 
tools were used infrequently due to curricular conflicts. Students 
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did not always have time for extra writing assignments and the 
systems could not support every writing genre that teachers wished 
to cover. In addition, although the systems seemed to promote 
essay revising, most revisions focused on mechanics rather than 
content, organization, or style. 

Grimes and Warschauer (2010) later examined MyAccess over 
a 3-year period in four middle schools. System use was initially 
infrequent—teachers did not create assignments in the system and 
students rarely revised. However, use increased over time as 
teachers became more comfortable with the technology. Survey 
data revealed both positive attitudes and skepticism. Teachers felt 
that MyAccess saved time, made teaching easier and more enjoy- 
able, and allowed them to focus on higher level concepts. Teachers 
also reported that students were more motivated to write. How- 
ever, teachers doubted the accuracy of the automated scores. They 
also favored MyAccess for persuasive essay writing but preferred 
traditional methods for informative, narrative, or analytical genre 
writing. Teachers also felt that MyAccess was suited to teaching 
sentence fluency and conventions, but less helpful for covering 
ideas, organization, voice, and word choice. Similarly, students 
perceived the system as usable and enjoyable, and felt that it 
increased their confidence and quantity of writing. However, stu- 
dents had trouble understanding the feedback and felt over- 
whelmed by the quantity of feedback. Some teachers had to create 
handouts to help students navigate the “pages of suggestions” from 
the system. In addition, some students began to focus on improv- 
ing their scores rather than communicating their ideas. 

In sum, research on AWE tools is promising but highlights how 
efficacy may be hindered by student and teacher perceptions. 
When users doubt the automated scores or feedback, or find them 
overwhelming, it is unlikely that the system will achieve its true 
potential. Another concern may be an emphasis on practice and 
feedback with less attention paid to strategy instruction or modular 
design. The fundamental purpose of AWE systems is the facilita- 
tion of writing assessment rather than teaching students about 
writing principles, goals, and strategies. Without such instruction, 
students may not be prepared to utilize the detailed writing feed- 
back these tools offer. Last, an emphasis on error feedback may 
not satisfy the principle of formative feedback. 


Computer-Based Tutorials for Writing 


A few technologies have been created to teach specific writing 
skills or to scaffold the writing process. For example, the LSA- 
based Summary Street (Caccamise, Franzke, Eckhoff, Kintsch, & 
Kintsch, 2007; Kintsch et al., 2007) supports students’ summari- 
zation skills. When students write summaries in the system, they 
receive graphical feedback showing how well their text captures 
the source materials. Research with Summary Street has shown 
that students wrote more effective summaries and spent more time 
engaged in writing when using the system. Perceptions of the 
system were also positive: students found the system easy to use 
and appreciated receiving feedback related to what they needed to 
fix in their summaries. Similarly, Wolfe et al. (2009) developed a 
web-based tutor for developing argument, counterarguments, and 
rebuttals. Evaluations of this system have shown the tutorial in- 
struction improved students’ ability to perform these tasks. Over- 
all, such research suggests that computer-based tutorials can be 
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effective for training students on specific strategies related to 
writing. 

Another technology, Computer Tutor for Writing (CTW; Row- 
ley & Meyer, 2003) adopted a scaffolding approach in which 
students wrote essays in an enhanced word processor. The inter- 
face provided “workspaces” in which students could view descrip- 
tions, examples, and hints related to the writing process, such as 
goal-setting, drafting, and publishing. A tracking system moni- 
tored completion of these tasks. Importantly, CTW did not provide 
a holistic score for essays, nor were students given error feedback 
or strategy guidance for improving their essays. Thus, writing 
support in CTW was instantiated solely as structured guidance 
during composition. An evaluation of the CTW with 471 middle 
and high school students (Rowley & Meyer, 2003) revealed no 
difference between control (i.e., no CTW training, n = 174) and 
experimental conditions (i.e., training with CTW, n = 298). Nei- 
ther group improved from pretest to posttest with regards to essay 
scores; control participants’ scores decreased by about 1%, 
whereas experimental participants’ scores increased by about 2%. 

Proske et al. (2012) adopted a similar scaffolding approach with 
the escribo system. In escribo, students receive online support for 
prewriting, drafting, and revising processes, along with feedback 
about their choices at each stage. Forty-two German university 
students practiced writing with or without the system in one 
training session and then wrote an unsupported essay in a posttest 
session. Overall, students who interacted with escribo spent more 
time planning their essays, which facilitated faster drafting of the 
text. escribo students also spent more time revising their essays 
and the resulting texts were rated as more comprehensible. Thus, 
when students are provided with both comprehensive strategy help 
and informative feedback on their writing process, computer-based 
tutorials for writing are more effective. 

In sum, previous computer-based writing tutors have shown 
mixed results, which may be attributed to whether feedback was 
provided. Successful tutors for summarization and argumentation 
focused on fewer skills but offered feedback on students’ perfor- 
mance. The main drawback is potentially their scope; they do not 
provide comprehensive or modular instruction related to the entire 
writing process. In contrast, CTW addressed all phases of writing 
with support for each task, but students did not receive strategy 
feedback. The system appeared to be of little benefit. However, 
when structured writing support is combined with feedback, as in 
escribo, empirical evidence suggests that a scaffolding approach 
can be effective. 


The Writing Pal 


In the development of W-Pal, we have sought to synthesize key 
principles of strategy instruction, modularity, extended practice, 
and formative feedback (McNamara et al., 2011). The interdisci- 
plinary development of the initial version of W-Pal spanned over 
3 years with input from cognitive psychology, linguistics, com- 
puter science, and English education. 


Writing Strategy Modules 


The principles of comprehensive strategy instruction and mod- 
ularized content were instantiated in W-Pal via nine Writing Strat- 
egy Modules (see Table 1). The content for these modules were 


Table 1 
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Module 


Description of Strategies 
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Practice Games 





Prologue 
Prewriting Phase 
Freewriting 

Planning 


Drafting Phase 


Introduction Building 


Body Building 


Conclusion Building 


Revising Phase 
Paraphrasing 


Cohesion Building 


Revising 


Introduces W-Pal, the animated characters, and discusses the importance of writing 


Covers freewriting strategies for quickly generating essay ideas, arguments, and 
evidence prior to writing (FAST PACE mnemonic) 

Covers outlining and graphic organizer strategies for organizing arguments and 
evidence in an essay 


Covers strategies for writing introduction paragraph thesis statements, argument 
previews, and attention-grabbing techniques (TAG mnemonic) 


Covers strategies for writing topic sentences and providing objective supporting 
evidence (KISS & Tell mnemonic) 

Covers strategies for restating the thesis, summarizing arguments, closing an essay, 
and maintain reader interest in conclusion paragraphs (RECAP mnemonic) 


Covers strategies for expressing ideas with more precise and varied wording, 
varied sentence structure, and condensing choppy sentences 

Covers strategies for adding cohesive cues to text, such as connective phrases, 
clarifying undefined referents, and threading ideas throughout the text 

Covers strategies for reviewing an essay for completeness and clarity (TETRIS 
mnemonic), and strategies for how to improve an essay by adding, removing, 


Freewrite Feud 
Freewrite Fill-In 
Mastermind Outline 
Planning Pump 


Essay Launcher 
Dungeon Escape 

Fix It — Introductions 
RAM-5 

Fix It — Bodies 

Fix It — Conclusions 
Dungeon Escape 


Adventurer’s Loot 
Map Conquest 
CON-Artist 
Undefined & Mined 
Speech Writer 


moving, or substituting ideas (ARMS mnemonic) 


developed based on research on writing strategy instruction (e.g., 
Graham & Perin, 2007) and substantive, iterative input from expert 
writing educators (Roscoe, Varner, Weston, Crossley, & McNa- 
mara, in press). Writing strategies were discussed by three ani- 
mated agents via lesson videos (15-30 min each). Dr. Julie 
(teacher agent) explained the strategies, and Mike and Sheila 
(student agents) demonstrated them (Figure 1). These characters 
were developed using Media Semantics Character Builder soft- 
ware and text-to-speech voices by Loquendo. For many lessons, 
multiple strategies were organized by acronymic mnemonic de- 
vices, which can facilitate adolescent students’ recall and use of 
writing strategies (e.g., De la Paz & Graham, 2002). Quiz and 
game-like checkpoints were embedded in the lessons to reinforce 


Paraphrasing 


- paraphrasing strategies help you express 
your ideas in a better 


- avoid repeating yourself too much 


- write sentences with different styles and 
different lengths 


- express your ideas more concisely 





Figure 1. Screenshot of Writing Pal virtual classroom (Paraphrasing 
lesson). 


the content, and students could take notes. All modules were 
accessible from a “Lessons Tab” in the W-Pal interface, which 
allowed users to progress through the modules in a flexible order. 


Game-Based Practice 


The principle of opportunities for extended feedback was real- 
ized by developing two broad modes of practice: game-based 
practice and essay writing practice. In W-Pal, a suite of educa- 
tional games allows students to practice specific strategies outside 
of the context of complete essays. For example, students can 
practice strategies for evaluating evidence or building cohesion 
before applying these strategies in their own persuasive essays. 
Game-based practice was also chosen to address problems of 
student engagement. One challenge for ITSs is that students be- 
come bored and frustrated with extended practice (Bell & McNa- 
mara, 2007; Jackson & McNamara, 2013). Games offer a means of 
improving students’ motivation to participate by leveraging their 
intrinsic enjoyment of gaming (Shank & Neeman, 2001). 

In W-Pal, each Writing Strategy Module was associated with 
one or more practice games that students “unlock” by completing 
the lessons (see Table 1). This version offered 15 unique games. 
These games were iteratively developed by selecting key strategies 
covered in the lessons and then constructing generative or identi- 
fication practice tasks. In generative practice, students write short 
texts (e.g., a conclusion paragraph) while applying one or more 
strategies. In identification practice, students examine text excerpts 
to label the strategies used, or to identify how strategies may be 
used to improve the text. These practice tasks were then embedded 
in diverse game mechanics and narratives. Feedback in the practice 
games was contextualized via the game design, such as winning or 
losing, earning points, the amount of fuel consumed by a space- 
ship, or the quality of treasure obtained. Thus, students could judge 
whether their strategy application was effective based on their 
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game progress. In some cases, formative feedback was also of- 
fered, such as tips for succeeding in the game by using certain 
strategies or mnemonics. 

To provide examples, we briefly present two games: Freewrite 
Feud and Essay Launcher. In Freewrite Feud (see Figure 2), 
students were given several minutes to freewrite on a persuasive 
writing prompt. For each prompt, a hidden list of key words and 
concepts was constructed based on a previous corpus of freewrites. 
Students earned points by typing their ideas quickly and continu- 
ously, and earned additional points when their freewrites incorpo- 
rated up to six of the key words. Because these key words were 
hidden from the player, this generative game encouraged students 
to practice brainstorming many ideas, arguments, and potential 
pieces of evidence because doing so would trigger the key words 
and earn a higher score. 

Essay Launcher was an Introduction Building game (see Figure 
3). In this identification game, students attempted to repair and 
rescue several spaceships. To “repair the ship,” students chose a 
thesis statement for an example introduction paragraph from a list 
of three options. To “set the course,” students turned a dial labeled 
with attention-grabbing techniques to identify the technique used 
in the paragraph. Once both selections were made, students con- 
sumed one fuel unit to launch the ship. If either choice was 
incorrect, the launch malfunctioned. Students then received feed- 
back about introduction strategies and could try again. Points were 
based on rescued ships and remaining fuel. This game allowed 
students to practice evaluating key characteristics of essay intro- 
ductions. 


Essay-Based Practice and Feedback 


The principles of formative feedback and opportunities for 
extended practice were supported by the W-Pal Essay Writing 
Interface (see Figure 4). W-Pal allowed students to practice 
writing timed persuasive essays using SAT-style prompts in 
which they could synthesize and apply strategies covered in any 
module. Students could select the prompt, set the time limit, and 
use a scratchpad for prewriting. Essays were written using a 
simple word processor and then submitted for automated as- 
sessment. 














Figure 2. 


Freewriting Feud practice game (Freewriting). 
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Figure 3. Essay Launcher practice game (Introduction Building). 


W-Pal scoring is powered by NLP algorithms utilizing Coh- 
Metrix and other text analysis tools (Crossley & McNamara, 
2011; Graesser & McNamara, 2012; McNamara, Crossley, & 
McCarthy, 2010; McNamara, Crossley, & Roscoe, 2012), and 
such algorithms are a key source of the intelligence of a writing 
ITS. Within ITSs that accept natural language as input (e.g., 
essays or verbal explanations of scientific processes), students’ 
responses are open-ended and potentially ambiguous. When a 
user enters natural language into a system and expects useful 
and intelligent responses, NLP is necessary to interpret that 
input (McNamara, Crossley, & Roscoe, in press). In service to 
these goals, W-Pal utilizes Coh-Metrix to analyze text on sev- 
eral dimensions of cohesion including co-referential cohesion, 
causal cohesion, density of connectives, lexical diversity, tem- 
poral cohesion, spatial cohesion, and LSA. Coh-Metrix also 
calculates syntactic complexity and provides psycholinguistic 
data about words (parts-of-speech, frequency, concreteness, 
imagability, meaningfulness, familiarity, polysemy, and hyper- 
nymy). 

Essays submitted to W-Pal initially received a holistic rating 
from poor to great (6-point scale). Writers also received feedback 
that addressed particular writing goals and strategy-based solutions 
(see Figure 5). Such feedback was implemented as a series of 
scaffolded, threshold-based algorithms based on different linguis- 
tic properties and categories: legitimacy (e.g., proportion of non- 
words), length (e.g., number of words), relevance (e.g., occurrence 
of key words), and structure (e.g., number of paragraphs). For 
example, writers whose essays lacked elaboration (i.e., short es- 
says) might receive feedback such as, “One way to expand your 
essay is to add additional relevant examples and evidence,” and 
prompts such as, “Have you created a flow chart or writing road 
map to help you organize your ideas?” The feedback also directed 
students toward relevant lessons or practice games. Importantly, 
feedback scaffolding helped to deliver only the most appropriate 
help; feedback was delivered only for the lowest threshold failed in 
the series of checks. We assumed that students who struggled to 
produce any text may not be ready to implement feedback about 
cohesion. Instead, these students may gain more from planning. If 
essays passed basic thresholds, they received feedback encourag- 
ing overall revision. Depending on the quality of individual sec- 
tions, essays also received formative feedback for introduction, 
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A hero is someone whe has been courageous or has done something to help another person, Celebrities can 
be heroes, but some are not. Peagie shouid not acmire cetebrities. They should admire heroes. Being a 
celebrity coes not mean one has the qualifications to be seen as & her. He may Aust be famous and fans 
Samire Chair fame, not their heroism. Although celebrities can be good people, everyone should admire 
heroes instead’ cue to their courage and willingness to help otners: 


When someone thinks of a hero one of the first thoughts to come ta ming is courage. Caurage is not being 
brave when everyone eise is scared. Courage is being able to persevere through an obstacie even if one 
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helping. For instance, a fireman may run into a burning buliding and save a child who was stuck inside. 
The fireman probably coes not know the child, yet puts his own life an the fine to save him. A here could 
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famous anc having money Coes not make anybocy heroic. 








iso De one that Conates money to charity or does community service. If a cefedeity were willing to 
volumeer, or Gonate money, or something of that sort, he is a hero. On the other hand, simply being 


| i There is ne coubt that celebrities can not be good people. It is pointicss, though, to admire them for 
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position. 
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Figure 4. Essay Writing Interface. 


body, and/or conclusion building strategies. For instance, an essay 
lacking a clear body might receive feedback stating “good writers 
often review their writing flowchart or an outline. Think about the 
best order and organization of the body paragraphs,” and asking, 
“Could a stranger understand your ideas without further explana- 
tion?” (Figure 5). 

Unlike previous AWE systems, W-Pal focuses on strategy in- 
struction and formative feedback and provides no specific error 
feedback on style, mechanics, spelling, or grammar. Spelling and 
grammar errors are relatively easy to detect, but assessing the 
quality and relevance of thesis statements, topic sentences, exam- 
ples, counterarguments, and many other essay elements is more 
difficult. In the case of thesis statements, for example, it is a 
nontrivial matter to determine which sentence writers intended to 
communicate their position, if any. Once this determination is 
made, one must assess how the thesis relates to the prompt, 
subsequent arguments, and argument structure. At this stage of 
W-Pal development, we focused on the broader categories, which 
necessarily limited the specificity of W-Pal feedback. 

In sum, development of W-Pal has sought to satisfy four central 
design principles that emerge from the ill-defined nature of writ- 
ing, which has not been demonstrated in previous technologies for 
writing instruction. A fundamental question for deployment was 
whether an intelligent tutor for writing could be feasibly imple- 
mented with our target population of high school students. Would 
students use the system? Would students perceive a “computer 
tutor” as a viable instructional resource? To address these ques- 
tions, we conducted a feasibility study in five high school English 
classrooms throughout a school year. Because our primary purpose 
was to assess feasibility, we did not employ a controlled experi- 
mental design (i.e., comparison to non-W-Pal instruction) or ab- 
lative design (i.e., selective removal of system features). Thus, 


strong conclusions about efficacy cannot be drawn about the 
impact of W-Pal from this study. 


Method 


Participants 


The intended users of Writing Pal are English-speaking high 
school students. Two high school English teachers and 141 10th 
grade students participated in this study over 6 months (November, 
2010 to May, 2011) with their English classrooms. Teachers were 
asked to use the entire W-Pal, including Writing Strategy Modules, 
practice games, and essays. However, they were not given strict 
rules for how W-Pal was to be integrated (e.g., module order, 
assignment pacing and duration, or curriculum integration). Teach- 
ers and students (via their teachers) could contact the W-Pal team 
for technical support and teachers had weekly conference calls 
with the researchers. The participating high school was located in 
the Washington, DC area, and enrolled over 2,400 students. The 
school enrolled 49.0% female students, with 22.3% Asian, 4.2% 
Black, 9.0% Hispanic, and 59.9% White students; 7.0% of students 
were described as limited English proficiency, and 10.9% qualified 
for free or reduced-price meals. 


Measures 


Data logging. As students interacted with W-Pal, their access 
of system tools was logged. To examine usage of W-Pal, we 
considered access and completion of the lesson videos, frequency 
of games played, and frequency of essay submissions. 

Lesson perception survey. After viewing each lesson, a five- 
item survey appeared. Using 4-point scales, students rated “how 


1016 





as Connectives and This-and-That 


Continue reading the feedback below to see other suggestions that could be helpful. 


Before writing body paragraphs, good writers often review their writing flowchart or an outline. Think about the best onder 
and organization of the body paragraphs. Also think about the quality of your evidence! 4 
* Would your evidence convince someone who did not know you well? 


it could be improved through a re-reading and careful revision. _ 
* Make sure you check for spelling and grammar mistakes in the essay 

* Have you double-checked that all of the important essay elements are included by using the TETRIS maemonic? 
* To strengthen your revising skills, try practicing in the Speech Writer game! 
* Another way to improve any paper is by making it more coherent. Try using the Cohesion Building Le aon pms 
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* Could a stranger understand your ideas without further explanation? 
* Remember... you can always add more elaboration to make sure that the reader understands your point of view! . 
* Also, don't forget to start each body paragraph with a short and simple topic sentence 

* Watch the Body Building and Planning Lessons to learn more about choosing good evidence 


Writing a strong conclusion is important, but not always easy. Remember that the purpose of an essay conclusion is to a 


summarize the thesis and suvvortine arguments. Trv to bring all of vour ideas tonether so that the reader can see how vour * 


Figure 5. Example essay feedback report. WPAL = Writing Pal. 


many new ideas” they learned (i.e., 0, 1-2, 3-4, or 5 or more 
ideas) and whether they would be willing view the lesson again. In 
open-ended items, students were asked to describe the “most 
helpful information” they learned, describe their perceptions of the 
animated characters, and provide suggestions for “how to improve 
this lesson.” 

Game perception survey. After interacting with W-Pal for 
several months (i.e., in February), students were asked to complete 
a four-item feedback survey of their perceptions of the games. 
Using 4-point scales, students rated a sampling of 11 games 
regarding helpfulness for practicing writing strategies, and rated 
the games regarding enjoyment. In two open-ended items, students 


were asked to provide suggestions for improving the helpfulness of 


the games and redesigning the games to be more enjoyable and 
engaging. 

Feedback perception survey. In addition to the Game Per- 
ception Survey, students completed an eight-item survey of their 
perceptions of the essay writing tools and feedback. Using 4-point 
scales, students rated the overall difficulty of using the essay 
writing interface, the difficulty of specific tools, feedback quantity, 
understandability of the feedback, and usability of the feedback. In 
two open-ended items, students were asked to offer suggestions for 
making the feedback “more clear, more understandable, or more 
usable” and to suggest what “essay features or writing strategies” 
should be included in future feedback. 

Pre- and post-study essays. Students wrote timed (25 min), 
prompt-based essays on two SAT-style prompts regarding 
“competition” and the influence of “images and impressions.” 
These essays were written offline (i.e., not within W-Pal), 
manually transcribed by the research team, and scored via 


natural language algorithms powered by Coh-Metrix (Crossley, 
Roscoe, Graesser, & McNamara, 2011). The accuracy of this 
algorithm, based on a separate test set of 105 essays and expert 
human scores, was 39% perfect agreement and 92% adjacent 
agreement. Descriptive information was also calculated for 
each essay, including the number of words, sentences, para- 
graphs, and sentences per paragraph. Text cohesion was as- 
sessed in terms of argument overlap (i.e., average overlap 
between head nouns and pronouns in adjacent sentences), 
given/new information (1i.e., a Latent Semantic Analysis score 
indicating the amount of given compared to new information), 
and lexical diversity (1.e., degree to which a variety of words 
versus the same words are used across the text, using the 
measure D; Malvern, Richards, Chipere, & Duran, 2004). Prior 
research has indicated that higher quality essays are associated 
with a decrease in cohesion and an increase in lexical diversity 
(Crossley & McNamara, 2011). We also examined measures of 
lexical sophistication typically associated with essay quality 
(e.g., Crossley, Weston, McLain Sullivan, & McNamara, 2011), 
including word concreteness, word hypernymy (i.e., specific- 
ity), and the number of hedging words (i.e., an indicator of 
uncertainty). 


Procedures 


Students wrote a pre-study essay in November. Throughout 
the school year, teachers incorporated W-Pal into their English 
classroom curriculum. Students viewed the lessons, played the 
games, wrote practice essays, wrote essays assigned by teach- 
ers, and completed the surveys. Essays assigned by the teachers 
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Figure 6. Total frequency of lesson viewing across a 6-month time 
period. 


often explicitly linked to reading assignments, such as Mo- 
liere’s Tartuffe. Students wrote a post-study essay in June. As 
this was an ecological setting, some students did not complete 
all assignments. 


Table 2 


Results 


Students’ Use of the System 


Students interacted with W-Pal for about 16 total hours, on 
average, but students’ use of W-Pal was unevenly distributed by 
module and across time. Figure 6 shows the distribution of strategy 
lessons accessed over the 6 months of the study (substantive 
activities are labeled with an abbreviation of the module name). 
Access was defined as a student interacting with at least one 
complete segment of the lesson. One pattern is that teachers mainly 
followed the sequence of prewriting, drafting, and revising. That 
is, they assigned the modules linearly in the “order” they were 
listed in the W-Pal interface. Teacher interviews indicated that 
they discouraged exploration; they preferred students to focus on 
current assignments and not to “get ahead.” Second, most use of 
W-Pal lessons occurred during the first 3 months and then became 
more sporadic. January was particularly active as teachers encour- 
aged students to complete the prewriting and drafting modules in 
preparation for SAT practice tests. Teachers did not assign lessons 
during February and April. Teacher interviews indicated that these 
months were devoted to separate writing assignments (e.g., a 
“how-to” paper), literature instruction (e.g., Tale of Two Cities and 
Things Fall Apart), and preparation for state exams. 

Over time, lesson activity appeared to decrease. This pattern is 
substantiated by the average completion percentage of each mod- 
ule (Table 2). In general, students seemed more likely to complete 
the earlier modules (e.g., Freewriting), but tapered off in the later 
modules (e.g., Revising). One explanation may be student fatigue. 
After 5 months of using W-Pal, any novelty had likely diminished. 
In addition, teachers’ focus on literature assignments and test 


Average Completion Percentage for Lesson Videos, Frequency of Game Play, and Maximum Number of Game Plays by Module 





Lesson completion 


Module and game M SD 
Prologue 86.0 DIES 
Freewriting 90.2 28.7 


Freewrite Feud 
Freewrite Fill-In 

Planning 83.4 B59 
Mastermind Outline 
Planning Pump 

Introduction Building 82.9 31.7 
Dungeon Escape 
Essay Launcher 
Fix-It 

Body Building 82.9 3712 
RAM-5 
Fix-It 

Conclusion Building 73.1 44.1 
Dungeon Escape 
Fix-It 

Paraphrasing 78.2 40.7 
Adventurer’s Loot 
Map Conquest 


Cohesion Building 54.3 49.3 
CON-Artist 
Undefined & Mined 

Revising 68.5, 45.1 


Speech Writer 


Game play 
M SD Maximum 
0.59 0.89 4 
0.59 0.85 3 
0.72 0.98 6 
0.76 0.82 4 
0.86 0.88 4 
0.45 0.60 4 
0.49 0.61 3 
OHS 0.36 1 
0.29 0.45 1 
0.53 0.77 4 
0.36 0.51 2 
0.44 0.51 2 
0.53 0.68 5 
0.31 0.56 4 
0.54 1.14 6 
0.29 0.54 3 
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preparation may have led to a decreased emphasis of W-Pal in the 
classroom. 

Figure 7 provides a similar visualization of students’ game 
playing across modules, with substantive activity labeled by mod- 
ule. Few games were played in November, as most students had 
not unlocked any games. However, more games were played in the 
following months once teachers assigned the planning and drafting 
modules. Interestingly, game play continued during February 
when no new modules were assigned. Interviews revealed that 
teachers encouraged students to use the games as further practice 
during this time. In the final months, however, students mainly 
accessed the games associated with assigned modules. Table 2 
shows the mean frequency of playing each game. Games encoun- 
tered earlier in instruction (e.g., Mastermind Outline), were played 
slightly more often than later games (e.g., Speech Writer). How- 
ever, there was variation in game play and some games from later 
modules were played as often as earlier games. The overall low 
frequency of play is likely a result of teachers’ discouragement of 
exploration. The wide variety of games offered by W-Pal may 
have also contributed. With many games to choose from, the desire 
to “master” any one game might have been low. 

Use of the essay writing tools was somewhat sparse because 
teachers used W-Pal for specific assignments rather than self- 
selected practice. Teachers assigned two to three W-Pal practice 
essays with automated feedback (on “Honesty,” “Uniformity,” or 
Heroes”) in December and January. Students were not required to 
revise these essays and course grades were based only on assign- 
ment completion. In April and May, teachers assigned students to 
write on the “Memories” prompt in relation to the novel Things 
Fall Apart (with automated feedback). Revising of this essay 
occurred outside of W-Pal via extensive peer reviewing. Teachers 
initially reported confusion about how teacher-created prompts 
differed from built-in W-Pal writing prompts—essays written on 
teacher-created prompts could not be assessed by the algorithm in 
this version of the system. However, after discussion about this 
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Figure 7. Total frequency of game playing across a 6-month time period. 
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functionality, teachers still chose to create two new assignments in 
W-Pal. In one essay, students wrote about interpersonal percep- 
tions in relation to the novel Tartuffe (January), and students 
responded to a newspaper article about the value of study halls in 
high schools (February). 

Interviews revealed that teachers perceived W-Pal’s essay tools 
favorably, and felt that the system allowed them to assign more 
writing that was feasible without W-Pal. Specifically, W-Pal pro- 
vided an accessible means for students to practice writing, with 
automated feedback, and teachers could access these essays and 
feedback online. W-Pal also provided several ready-made writing 
prompts for assignments. However, the system could not support 
the full range of writing assignments that were required in the 
curriculum, a common problem for AWE systems (e.g., Grimes & 
Warschauer, 2008). Different writing genres (e.g., journalism and 
narrative) possess unique constraints that cannot be assessed by the 
same algorithm; computational linguistics models must tailored to 
each type. Most systems, including W-Pal, have focused upon 
persuasive writing due to its importance for standardized testing. 
Other genres are not currently supported but are a target for future 
development. Teachers also understood that W-Pal was still “in 
development” and thus were somewhat wary of basing students’ 
grades on W-Pal assessments. This concern may also have limited 
the number of practice essays teachers assigned. Teachers may 
have been hesitant to utilize W-Pal for writing practice unless they 
could also review or grade the assignments independently. Teach- 
ers understood the scoring and feedback procedures but, as con- 
scientious instructors, they wanted to remain actively aware of and 
involved with their students’ work and progress. 

In sum, students used a variety of W-Pal features but did so 
unevenly over the year. W-Pal deployment was not a smooth and 
continuous process; as with any educational resource, teachers 
were selective and opportunistic about how and when to use the 
system. Results also suggest that engagement with the system 
declined over time. We next consider students’ perceptions of 
W-Pal and how such perceptions may have impacted system use 
and feasibility. 


Lesson Perceptions 


Figure 8 (left side) presents the percentage of students as a 
function of the number of ideas they reported having learned from 
the lessons. In general, students reported the lessons to be helpful 
and informative. On average and across lessons, over half of the 
students (55.8%) reported learning three or more ideas per lesson. 
Within the open-ended questions asking students to summarize the 
most helpful idea \earned from the lessons, the mnemonic devices 
were the most frequent response. Thus, students seemed to value 
and remember the acronyms such as TAG, RECAP, and ARMS 
designed to cue recall of specific strategies. In contrast, students 
disliked the presentation of the lessons (Figure 8, right side). On 
average and across lessons, many students viewed the characters as 
awkward (62.3%) and boring (60.6%), but still informative 
(30.4%). 

In open-ended responses (see Table 3), students critiqued agent 
dialog and requested succinct instruction with more competent and 
less “cartoonish” characters. The computerized voices were also 
unpopular, in part because of a text-to-speech glitch that some- 
times caused overlapping speech. Both students and teachers re- 
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Figure 8. Student perceptions of learning and animated characters in Writing Pal lessons. 


quested that the lessons be shorter and faster, while retaining all of 
the information. These concerns are summarized by one student, 
who commented, “the little jokes between characters aren’t amus- 
ing, especially in the monotone computer voices. Cutting out all 
the unnecessary dialogue between characters would shave off a 
good amount of time.” Altogether, these results support the earlier 
hypothesis that lesson use decreased due to fatigue. As students 
progressed through the lessons, and encountered the same design 
issues, students’ willingness to engage with the lessons likely 
decreased. 


Game Perceptions 


Across sampled games, students (n = 116) reported the 
games to be somewhat helpful (50.5%) or very helpful (29.6%) 


Table 3 


for practicing the writing strategies (Figure 9, left). Similarly, 
students reported the games to be somewhat enjoyable (46.4%) 
or very enjoyable (19.1%) to play (Figure 9, right). Thus, most 
students felt that the games they played were beneficial and 
generally engaging. Open-ended comments (see Table 3) high- 
lighted ways in which students felt the games could be im- 
proved. For example, one student requested that we make the 
games “more challenging [because even] if I hadn’t taken the 
W-Pal lessons, I would have been able to complete the chal- 
lenges with fairly high scores.” Other students expressed inter- 
est in further generative practice, such as “when we learn the 
strategies, I think should be a challenge where we actually use 
the strategy instead of finding them in essays.” Another student 
suggested that “the games could be more difficult and more 


Student Responses and Recommendations Regarding Strategy Lessons and Practice Games 


1 


. Students valued the strategies and mnemonics. 


Observation Examples 


“FAST PACE is going to help me write better essays! I learned important 
acronyms, and information. I learned to think about the prompt, add 
questions, think about the opposing side” 

“The TAG mnemonic and the attention grabbing techniques were very helpful 
for making me understand introductions better” 

“RECAP—testate, explain ideas, closing, avoid new things, present 
interestingly” 


2. Students disliked the length and presentation style of the lessons. “Their voices are very robotic and the lesson was way too long, maybe if it 


4. Some students found the game instructions inadequate. 


5 


ON 


. Students requested that more game elements be added. 


was split into several sections then it would be easier to concentrate on the 
task” 

“Had very good information but I disliked the synthesized voices” 

“The information is good but I lost interest throughout the lesson. I feel like I 
would learn a lot more if the information went faster and was 
straightforward” 


. Students desired games that were more difficult and interactive. “When we learned the strategies, I think there should be a challenge where we 


actually use the strategy instead of finding them in essays” 

“Make the challenges more challenging. Even if I hadn’t taken the W-Pal 
lessons, I would have been able to complete the challenges with fairly high 
scores” 

“Some of the instructions were hard to follow” 

“J had a little trouble understand exactly what to do with the directions.” 


. Students suggested improvements in the game graphics and sound. “The games could have better graphics and music to make the games more 


enjoyable” 

“The games are slow and the graphics are not the best, so unfortunately, the 
games become boring which weakens their effectiveness” 

“Many of the games were not very fun because they had a learning element 
that was very obvious. It would be better if the element was not as obvious, 
so the game was more fun. Basically, more pictures and music and less 
words” 

“Make it a point system and make it a competition amongst our peers” 
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Games: Enjoyment 
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Figure 9. Student perceptions of the helpfulness and enjoyment of games. 


interactive for learning these writing strategies rather than just 
reading.” In sum, students valued the games, but positive per- 
ceptions may have been impacted by games that lacked chal- 
lenge, opportunities for interaction, or clear directions. 


Essay Writing and Feedback Perceptions 


Overall, students (n = 103) rated the essay writing tools as 
easy or very easy (81.5%) to use (Figure 10, top left). However, 
two features frustrated some students: 23.7% of students re- 
ported that reading the feedback was somewhat or very difficult, 
and 24.6% felt that revising their essays was somewhat or very 
difficult. This may have been due to feedback quantity or clarity 
(Figure 10, top right). Although most students reported that 
they received just the right amount of feedback (49.5%), others 
reported that they received not enough (38.8%) or too much 
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(11.6%). From internal testing (Roscoe, Varner, Cai, Weston, 
Crossley, & McNamara, 2011), we knew that feedback quantity 
could be variable. Essays that failed a basic check (e.g., length) 
received only one feedback message. However, essays that 
advanced further could receive more messages on multiple 
topics. These extremes may have led to perceptions of insuffi- 
cient or overwhelming feedback, respectively. Similarly, as 
shown in Figure 10 (bottom left), most students rated the 
feedback as understandable (61.2%), but some students rated 
the feedback as somewhat confusing (29.1%) or very confusing 
(9.71%). Despite these challenges, students rated the feedback 
as useful (Figure 10, bottom right) occasionally (45.6%) or 
often (33.0%). 

Students’ open-ended responses (see Table 4) further high- 
lighted student concerns. Specificity was a particular critique; 
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Figure 10. Student perceptions of ease of use, quantity, understandability, and usefulness of automated essay 


feedback. 
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Student Responses and Recommendations Regarding Essay Scoring and Feedback 





Observation 





7. Students requested more specific feedback. 


8. Students requested more individualized feedback. 


9. Students expressed conflicting concerns about the quantity of feedback. 


10. Some students expressed skepticism at the speed and accuracy of 
scoring. 


students commented that “the constructive criticism could be a bit 
more detailed on what the writer needs to work on instead of an 
overview” or should “be specific and give us exact examples on 
what we should do to improve our writing.” Other students re- 
quested that the feedback system provide information on both the 
strengths and weaknesses in an essay, e.g., “the automatic feed- 
back could also give you good points on your essay, what was 
strong and what you should continue do.” Thus, the provision of 
feedback at the level of broad categories (e.g., body building 
strategies) rather than specific essay elements (e.g., evidence qual- 
ity) was helpful but inadequate for some students. Overall, the 
feedback provided by W-Pal in this study was perceived as ben- 
eficial and relevant to students’ needs, but the content of the 
feedback should be expanded to address more detailed issues. 


Essay Quality 


The natural language algorithm analyses of pre-study and post- 
study essays (n = 113) are provided in Table 5. Essay scores 


Examples 


“The feedback should show what specific things made me get the grade” 

“The feedback needs to be more helpful for us on our own personal 
essay. Not just general feedback. I don’t know what I did wrong in 
my essay when you just give a general understanding of it” 

“My introduction and my supports. I still have a hard time finding 
supports that directly answer the question” 

“T would like to know in the feedback if my examples were not strong 
enough, if I had a weak thesis, things like that” 

“Use less feedback and cut straight to the point of what the essay needs 
and give examples” 

“The feedback is very brief. W-Pal never really tells you what you need 
to improve on.” 

“T do not like how the essay is graded in less than a second! I feel my 
essay is not being graded properly and I don’t feel I have been given 
accurate feedback” 

“You cannot grade an essay in 5 seconds! Everybody gets the same 
grading of “fair.” I can’t use it if I don’t believe that it is true.” 


increased significantly from a mean of 2.3 (SD = 0.8) prior to the 
study to a mean of 2.9 (SD = 0.8) after the study, (112) = 5.85, 
p < .001, d = 0.71. Associated with these gains were positive 
changes in essay structure and lexical sophistication (see Table 5). 
Post-study essays were longer, containing more words and sen- 
tences. Essays also showed a clearer paragraph structure, with 
more paragraphs overall and somewhat fewer sentences per para- 
graph (e.g., fewer students wrote one-paragraph essays). Post- 
study essays improved in vocabulary use, including more concrete 
wording, more precise wording (word hypernymy), fewer hedging 
words (e.g., maybe or might), and greater diversity. Finally, essays 
showed more developed and elaborated content with less repetition 
of themes (less overlap of arguments and given information) and 
wording (increased lexical diversity). 

Given the patterns of W-Pal use throughout the feasibility study, 
it would be unlikely to observe strong effects of using the system 
on essay gains. W-Pal was only one component of a broader 
curriculum. Nonetheless, to assess how and whether use of W-Pal 








Table 5 
Essay Characteristics for Pre- and Post-Study Timed Essays 
as a EEE Eee eee 
M (SD) 
Measure Pre Post (112) P 
Essay score 2.30 (0.84) 2.88 (0.79) 5.85 <.001 
Length 
Number of words 260.81 (76.38) 308.27 (84.49) 6.49 <.001 
Number of sentences 15.46 (5.10) LS 273) 5.66 <.001 
Structure 
Number of paragraphs 3.43 (1.32) 3.97 (0.83) 3.87 <.001 
Sentences per paragraph 5.33 (3.02) 4.72 (1.44) ASS 071 
Cohesion* 
Argument overlap 0.51 (0.17) 0.41 (0.14) —5.04 <.001 
Given/new information 0.32 (0.04) 0.30 (0.04) —4.43 <.001 
Lexical diversity $5,132.39) 98.29 (21.56) S21, <.001 
Lexical sophistication 
Word concreteness 387.00 (32.29) 405.53 (30.78) 3.87 <.001 
Word hypernymy 1.57 (0.23) 1.66 (0.19) 4.23 <.001 
Hedging words 14.2 (10.6) 9.9 (7.6) —4.10 <.001 


@These cohesion indices indicate the extent to which arguments, ideas, and words are repeated across sentences 


and throughout the text. 


1022 


might have influenced writing proficiency, an exploratory linear 
regression analysis was conducted to identify potential predictors 
of post-study essay quality. Eight predictor variables were simul- 
taneously entered. As measures of students’ prior writing ability 
and knowledge, pre-study essay scores and self-reported grade- 
point average (GPA) were included. As indicators of system use, 
we included students’ percentage completion of prewriting lessons 
(Freewriting and Planning), drafting lessons (Introduction Build- 
ing, Body Building, and Conclusion Building), and revising les- 
sons (Paraphrasing, Cohesion Building, and overall Revising). 
Similarly, we included the frequency of game play within each 
phase: prewriting games (Freewrite Feud, Freewrite Fill-In, Mas- 
termind Outline, and Planning Pump), drafting games (Essay 
Launcher, Dungeon Escape, Fix It, and RAM-S5), and revising 
games (Adventurer’s Loot, Map Conquest, Undefined & Mined, 
CON-Artist, and Speech Writer). Because teachers chose to re- 
strict essay writing practice, there was little variability in essay 
writing, and this variable was not included. 

The resulting linear regression model was significant, F(112) = 
2.93, p = .005, R? = .18, accounting for about one fifth of the 
variance in post-study essay scores (see Table 6). Two variables 
were predictive of essay quality: pre-study essay scores and view- 
ing of the drafting lessons. Interestingly, students’ prior writing 
ability (pre-study essay score), but not their GPA, was a significant 
predictor of post-study essay quality. These results suggest that 
writing skill was not solely a function of students’ prior academic 
abilities, but reflected knowledge of specialized skills and strate- 
gies related to writing. Students’ completion of the drafting lessons 
was positively associated with their writing development above 
and beyond prior writing ability. Drafting lessons are perhaps the 
most immediately relevant to students’ writing of timed essays, 
because they provide direct strategies for generating essay text. 
Overall, although we cannot conclude that W-Pal directly im- 
proved students’ writing, these results tentatively support the fea- 
sibility of intelligent tutoring of writing in high school classrooms. 


Discussion 


The unique design of W-Pal was informed by the ill-defined 
nature of writing, in which there is significant ambiguity and 
subjectivity with respect to pedagogy and assessment. We have 
sought to provide comprehensive and modular strategy instruction, 
diverse opportunities for extended practice, and formative feed- 
back on students’ writing. In this study, we evaluated how W-Pal 
was perceived by high school in English classrooms. A fundamen- 


Table 6 
Linear Regression Analysis to Predict Post-Study Essay Scores 
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tal assumption was that feasibility depends on whether users view 
the system as a valid and valuable tool for instruction and feed- 
back. Thus, students’ use and perceptions of W-Pal were the 
central focus. 

Our results suggest that this initial version of W-Pal was gen- 
erally well received. Most components of W-Pal were judged as 
beneficial sources of writing instruction, practice, and feedback. 
Students could describe specific content that they learned from the 
lessons and games, and rated these tools and essay feedback as 
helpful and easy to use. Students seemed to view a “computer 
tutor” as a worthwhile addition to the English classroom curricu- 
lum. Preliminary evidence also suggests that students benefitted 
from using certain W-Pal tools. Thus, the initial iteration of W-Pal 
was feasible with regards to positive user perceptions and usage. 

Our results also highlighted several problems to overcome that 
may undermine long-term feasibility and potential efficacy. First, 
students felt that the lessons were too long and didactic, and 
disliked the cartoonish characters in the lessons. In some ways, the 
lengthy lesson videos were too similar to a presentational mode of 
writing instruction described by Hillocks (1984). Hillocks con- 
trasted writing outcomes for interventions that employed different 
instructional modes and content. The most effective instruction 
occurred in an environmental mode wherein instructors minimized 
lecturing and focused on specific objectives and strategies, with 
ample opportunities for scaffolded practice. In contrast, instruction 
was less effective in the prescriptive and teacher-dominated pre- 
sentational mode. Although interactive checkpoints were included 
in the lessons, students’ overall perceptions were that the lessons 
were too long, boring, and lecture-like. This lesson structure may 
also have insufficiently met the goal of providing modular instruc- 
tion; each lesson video comprised multiple strategies related to 
multiple goals. A series of shorter lessons, each with a focus on 
one or two related strategies, may have been more germane to 
Hillocks’ environmental mode. Students could iterate between 
lessons and practice more flexibly, and instructors could be more 
selective with the content they wished to cover. 

More broadly, the issue of information density within instruc- 
tional modules speaks to the appropriate grain-size of ITS instruc- 
tion in ill-defined domains. When learners must make many stra- 
tegic decisions to enact a task, instruction may need to focus 
initially on fewer decisions before asking students to synthesize 
them. With each additional, simultaneous strategy choice, it be- 
comes more difficult for learners to perceive the impact or utility 
of each strategy. In problem-solving domains (e.g., physics), re- 


nnn 


Variable r B 
Pre-study essay score 31 0.273 
GPA me —0.031 
Prewriting lessons 05 —(0.002 
Drafting lessons 17 0.004 
Revising lessons 08 —(0.001 
Prewriting games SO —0.015 
Drafting games 00 —0.054 
Revising games LD) 0.058 





B SE t Pp 
293 088 3.09 003 
—.026 118 ~0.26 794 
=,131 002 =1.02 308 
431 001 2.81 006 
—.157 001 =1.20 233 
—.053 032 —0.47 640 
—.191 035 -1.54 126 
Si 037 1.59 115 


Ene ne ied 
Note. GPA = grade-point average. Estimated constant term is 2.32. Boldface font indicates statistically significant predictors. 
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search has shown benefits for systems that require students to 
specify each step of their solution process rather than merely the 
final answer (Hausmann, VanLehn, Nokes, & Gershman, 2009: 
VanLehn et al., 2005). This decomposition allows the system to 
assess and provide feedback for individual steps, and learners are 
encouraged to consider the impact of each decision. Analogously, 
in intelligent tutoring in ill-defined domains such as writing, it may 
be beneficial to teach fewer writing strategies at one time so that 
students can more gradually build up to the full complexity of the 
writing process. The modular content of the ITS should facilitate 
the decomposition of complex processes into manageable units for 
initial learning, which can be subsequently recombined and ap- 
plied strategically in later practice. 

A second critique expressed by students related to types and 
difficulty of learning tasks presented in the educational games. 
Surprisingly, some students expressed interest in more difficult 
games that required active generation of text. Such students 
wanted to practice by applying strategies to their own writing 
rather than inspecting examples written by others. Not surpris- 
ingly, we also observed a high degree of variability in students’ 
game preferences. Games that were played frequently or rated 
highly by some students were despised by others, and vice versa. 
Only a few games were broadly disliked; for instance, RAM-5 (a 
body building game in which students matched potential evidence 
to topic sentences) had little replay value, and the task was vague. 
A few games were liked by the majority of students. One example 
was Map Conquest, a Risk-like game in which students earn 
resources by identifying paraphrasing strategies and then use those 
resources to “conquer” a map controlled by computer opponents. 
An interesting facet of this game is that the learning task (identi- 
fying paraphrases) and the game task (taking over the map) are 
disjoint. Success in the learning task did not guarantee success in 
the game, and vice versa. This might have made the “gaming” 
aspects of the practice more salient for some students. 

The positive perception of educational games in W-Pal suggests 
that this could may a valuable component for intelligent tutoring in 
ill-defined domains. Specifically, games may help to offset some 
of the motivational threats that undermine students’ engagement 
with ITSs and extended practice. Success in ill-defined domains 
requires learning of underspecified concepts and relations, and the 
ability to recharacterize problems to apply available strategies 
(Lynch et al., 2009). Developing such skills may be frustrating as 
students struggle to master many decisions and tasks. Indeed, 
students often report high apprehension and low confidence re- 
garding their writing abilities (e.g., Pajares, 2003). Our results hint 
that educational games may help to ameliorate some of the affec- 
tive challenges that arise with learning in ITSs and ill-defined 
domains (e.g., Craig, Graesser, Sullins, & Gholson, 2004). Games 
may provide a more pleasant setting where practice is embedded 
within an enjoyable experience, and feedback is framed within 
game mechanics or narrative rather than overt critique. However, 
based on these findings, developers who wish to bolster ITSs with 
educational games should ensure that the games offer sufficient 
challenge, promote generative activity, and exhibit varied game- 
play. 

A final concern revealed by the study, and perhaps the greatest 
challenge for future development, was the need for more specific 
and individualized feedback. Students expressed a clear desire to 
learn more about the individual strengths and weaknesses of their 


essays, and a lack of such specificity undermined confidence in the 
system for some students. However, improvements to W-Pal’s 
feedback engine will require sophisticated additions and refine- 
ments to underlying computational linguistics algorithms. Al- 
though the framing and content of the feedback is paramount— 
feedback must be well-constructed to provide actionable 
suggestions in a scaffolded and nonthreatening manner—the feed- 
back process is necessarily constrained to essay features that can 
be reliably detected. We are currently exploring alternative meth- 
ods for developing feedback algorithms. 

Issues of valid and formative feedback generalize beyond essays 
and writing. Algorithm development is likely to be a key obstacle 
in the growth of tutors for writing and other ill-defined domains 
(McNamara et al., in press). Any ITS that accepts open-ended or 
natural language input, and attempts respond to learners with 
intelligent guidance and help, may need to solve a similar set of 
problems. For example, an ITS that allows users to explain scien- 
tific concepts will require algorithms that can process and interpret 
users’ intended answers. Tutorial feedback, such as corrective 
hints or explanations, will be more valuable to the extent that users 
believe the system can target their individual strengths, weak- 
nesses, knowledge, and misconceptions. 


Conclusion 


W-Pal development and testing have revealed several issues and 
lessons for building an ITS in ill-defined domains. Some of these 
feasibility problems may be termed presentational, in that they can 
be overcome by redesigning the interface or mode of instruction to 
be more modular, engaging, succinct, game-like, and so on. These 
are relatively easy to fix—more recent iterations of W-Pal have 
already addressed a number of concerns—although they are often 
only revealed through extensive usability and feasibility testing. 
Other feasibility issues may be termed algorithmic and relate to the 
methods by which complex, open-ended, and ambiguous student 
inputs are processed and evaluated. New and innovative methods 
for assessing such inputs may be required to realize the full 
potential of intelligent tutoring in ill-defined domains. However, in 
ill-defined domains, a certain level of permanent ambiguity may 
have to be embraced, and the focus must be on guiding students 
toward progress and independence, rather than delivering, correct- 
ing, or testing a well-defined body of knowledge. 
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Correction to Hernandez et al. (2013) 


In the article “Sustaining optimal motivation: A longitudinal analysis of interventions to broaden 
participation of underrepresented students in STEM” by Paul R. Hernandez, P. Wesley Schultz, 
Mica Estrada, Anna Woodcock, and Randie C. Chance (Journal of Educational Psychology, Vol. 
105, No. 1, pp. 89-107 doi: 10.1037/a0029691), there was an error in the Appendix. The items 
listed below should have appeared without an asterisk. 


TGO-6. An important reason I do my school work is because I enjoy it. 


PAp-2. It’s important to me that the other students in my classes think that I am good at my work. 


PAp-3. I want to do better than other students in my classes. 


PAy-l. It’s very important to me that I don’t look stupid in my classes. 


PAy-5. One reason I would not participate in class is to avoid looking stupid. 
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Learning Intercultural Communication Skills With Virtual Humans: 


Feedback and Fidelity 


H. Chad Lane, Matthew Jensen Hays, Mark G. Core, and Daniel Auerbach 
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In the context of practicing intercultural communication skills, we investigated the role of fidelity in a 
game-based, virtual learning environment as well as the role of feedback delivered by an intelligent 
tutoring system. In 2 experiments, we compared variations on the game interface, use of the tutoring 
system, and the form of the feedback. Our findings suggest that for learning basic intercultural 
communicative skills, a 3-dimensional (3-D) interface with animation and sound produced equivalent 
learning to a more static 2-D interface. However, learners took significantly longer to analyze and 
respond to the actions of animated virtual humans, suggesting a deeper engagement. We found large 
gains in learning across conditions. There was no differential effect with the tutor engaged, but it was 
found to have a positive impact on learner success in a transfer task. This difference was most 
pronounced when the feedback was delivered in a more general form versus a concrete style. 


Keywords: virtual humans, intelligent tutoring systems, sense of presence, feedback, intercultural 


communication 


Pedagogical agents are animated characters that inhabit virtual 
learning environments and usually play the role of tutor (Haake & 
Gulz, 2009; Johnson, Rickel, & Lester, 2000) or peer (Y. Kim & 
Baylor, 2006). In these roles, the agent typically works alongside 
the learner to provide guidance (Arroyo, Woolf, Royer, & Tai, 
2009), hold conversations (Graesser & McNamara, 2010), and 
encourage and motivate (Baylor, 2011), among many other forms 
of possible scaffolding. The role of pedagogical agents in virtual 
learning environments continues to expand. One use of pedagog- 
ical agents is replacing a human role player. Thus, instead of the 
agent assisting the learner with problems, it is the interaction itself 
with the agent that is intended to have educational value. Here, the 
agent is usually a virtual human playing a defined social role, with 
learners also playing a role and using specific communicative 
skills to achieve goals. For example, to prepare for an international 
business trip, a learner might meet with a virtual foreign business 
partner from the country of interest to negotiate a fictional contract 
agreement. 

The technology challenge is to simulate social encounters in 
realistic ways and in authentic contexts. The pedagogical challenge 
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is to design scenarios in ways that achieve the learning goals, 
maintain a high level of real-world fidelity, and stay within an 
ideal window of challenge (whatever that may be). The basic 
problems of doing this with virtual humans are eloquently stated 
by Gratch and Marsella (2005): 


These “virtual humans” must (more or less faithfully) exhibit the 
behaviors and characteristics of their role, they must (more or less 
directly) facilitate the desired learning, and current technology (more 
or less successfully) must support these demands. The design of these 
systems is essentially a compromise, with little theoretical or empir- 
ical guidance on the impact of these compromises on pedagogy. 
(p. 256) 


The natural tendency is to build simulations to maximize real- 
ism since authentic practice opportunities are essential both for 
learner motivation and transfer to real-world contexts (Sawyer, 
2006). However, some questions have been raised regarding the 
definition of realism as it applies to human communicative behav- 
iors. Human variability due to personality and cultural differences 
suggest that virtual humans may have a small amount of flexibility 
to adapt to learners’ needs while remaining realistic (Wray et al., 
2009). Further, the design of virtual human scenarios can have a 
profound influence on the efficacy of the resulting learning expe- 
riences and should be carefully constructed to exercise the targeted 
communicative skills (Ogan, Aleven, Jones, & Kim, 2011). 

In this article, we describe a game-based system for teaching 
intercultural communication skills and an associated intelligent 
tutoring system (ITS). We then present two studies investigating 
issues related to fidelity and feedback, both of which are important 
factors in virtual learning environments with virtual role players. 
The goal is to identify the influences of these factors on learner 
behaviors and on their acquisition of new communication skills. 
The article ends with a summary of the results, limitations of our 
studies, and a discussion of future research topics. 
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The Acquisition of Intercultural Communication Skills 


Social skills (or equivalently, interpersonal skills) form the 
foundation for both simulation of communicative skills (using a 
virtual human) and for teaching communicative competence. Al- 
though no clear consensus has emerged on a single definition of 
social skills, most include the notions of choosing appropriate and 
effective communicative actions for a given context (Segrin & 
Givertz, 2003). Because of our specific focus on intercultural 
communication, we adopted the more precise definition of social 
skills as “the ability of an interactant to choose among available 
communicative behaviors in order that [she or] he may success- 
fully accomplish [her or] his own interpersonal goals during an 
encounter while maintaining the face and line of his fellow inter- 
actants” (Wiemann, 1977, p. 198). It is worth noting that what 
constitutes success in a social interaction is not always obvious, 
rational, or consistent. Further, how interpersonal and communi- 
cative goals are established may or may not be evident (Spitzberg 
& Cupach, 2002). 

Despite the peculiarities of human communication, the concept 
of social skills can be broken down in many different ways. One 
of the simplest is to consider two fundamental processes: message 
reception (Wyer & Adaval, 2003) and message production 
(Berger, 2009). Message reception refers to one’s ability to both 
interpret social signals of others (such as speech and nonverbal 
behaviors) and infer meaning from the communicative acts con- 
veyed by those social signals. The receiver must both have (a) the 
motivation to interpret and process the message and (b) the knowl- 
edge necessary to comprehend it (Wyer & Adaval, 2003). 

Challenges to successful decoding of a message can come from 
contextual and pragmatic sources in the immediate environment, 
as well as from internal biases or beliefs. For example, assump- 
tions one makes on the basis of stereotypes can greatly impede 
message reception. On the message production side, similar chal- 
lenges arise. How one forms a message (consciously or not) 
depends again on context, beliefs, biases, and so on. Automated 
communicative skills are deeply rooted and, thus, difficult to 
modify in ways that enhance the odds of producing more effective 
outgoing messages. Nonetheless, the acquisition of novel commu- 
nicative skills has been shown to follow the same patterns as other 
cognitive skills (Greene, 2003), and so the same techniques used to 
promote learning should apply. For example, it is known that 
repeated practice opportunities with feedback are an essential 
component in the development of expertise (J. R. Anderson, Cor- 
bett, Koedinger, & Pelletier, 1995; Kluger & DeNisi, 2004; Shute, 
2008). We have applied these foundational principles in our work 
by providing a virtual practice environment for intercultural com- 
munication skills with automated feedback. 


Virtual Humans as Role Players 


Live role playing has a long history in education (Kane, 1964) 
and for teaching social interaction skills (Mendenhall et al., 2006; 
Segrin & Givertz, 2003). There are problems, however, with the 
approach. First, role playing in classrooms is not situated in a 
realistic context, which potentially limits transfer of the learned 
skills. Second, when peers act as role players, the attitudes, con- 
versational content, and so forth of the role play may not be 
authentic or realistic. Third, expert human role players are gener- 
ally believed to be the best option but are not cost effective and can 


be prone to inconsistency and fatigue. Although virtual humans 
have significant limitations, they undoubtedly address some of 
these complex issues (Cassell, Sullivan, Prevost, & Churchill, 
2000; Lim, Dias, Aylett, & Paiva, 2012). 


Empirical Support for Learning With Virtual Humans 


Can virtual humans be effective role-players? Seminal work 
presented by Reeves and Nass (1996) in The Media Equation 
showed that people bring many of their usual assumptions about 
human—human interaction to computer-based interactions. Fur- 
ther, evidence is mounting that this result holds even more strongly 
when the computer presents a virtual agent (Gratch, Wang, Gerten, 
Fast, & Duffy, 2007; Pfeifer & Bickmore, 2011; Zanbaka, Ulinski, 
Goolkasian, & Hodges, 2007). In other words, people treat virtual 
humans as if they are real. Further, characters who provide per- 
sonalized interactions are known to increase feelings of social 
presence, which in turn enhance learning (Moreno & Mayer, 
2004). Learning can also be enhanced when learners choose to 
adopt social goals (e.g., “come to know your partner’) while 
interacting with virtual humans (Ogan, Kim, Aleven, & Jones, 
2009). Together, these results suggest that virtual humans can 
induce feelings of social presence in learners, that these feelings 
are enhanced through personalization and simulation of social and 
relational behaviors, and, ultimately, that we should expect a 
concomitant improvement in learning. 

Early studies of the efficacy of virtual-human-based systems to 
teach intercultural skills seem to support this conclusion. Signifi- 
cant gains in overall learning were found for Tactical Iraqi (Sur- 
face, Dierdorff, & Watson, 2007) as well as Bilateral Negotiation 
Trainer (BiLAT; Durlach, Wansbury, & Wilkinson, 2008; J. M. 
Kim et al., 2009; Lane, Hays, Auerbach, & Core, 2010). Unfor- 
tunately, these and similar studies of other virtual learning envi- 
ronments for culture do not compare the systems with traditional 
(e.g., classroom-based) intercultural training, so it is not yet known 
if they are more effective than classroom-based learning (Ogan & 
Lane, 2010). 

Virtual humans have been used successfully as role players in 
many contexts. For example, virtual agents have served as patients 
in clinical training (Johnsen, Raij, Stevens, Lind, & Lok, 2007), 
persons of interest in police officer training (Hubal, Frank, & 
Guinn, 2003), modelers of healthy play for children with autism 
(Tartaro & Cassell, 2008), victims and perpetrators of bullying in 
school settings (Aylett, Vala, Sequeira, & Paiva, 2007; Sapouna et 
al., 2010), and modelers of coping behaviors for mothers of chil- 
dren with serious illness (Marsella, Johnson, & LaBore, 2000). A 
key question for the intelligent virtual agent community is whether 
effectiveness will also increase with increased sophistication of the 
agents. 


BiLAT: Teaching Bilateral Negotiation 
With Cultural Awareness 


The context for our work is BiLAT, a game-based simulation for 
practicing the preparation, execution, and understanding of bilat- 
eral meetings in a cultural context (J. M. Kim et al., 2009). As part 
of an overarching narrative, learners prepare and meet with a series 
of virtual humans to solve problems in a fictional Iraqi city. Even 
though BiLAT’s overall scope is much broader, our focus is on 
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face-to-face meetings between learners and virtual characters and 
the basic intercultural communicative skills necessary to build 
trust and reach agreements. BiLAT meetings emphasize both mes- 
sage production and reception skills as discussed earlier. 

In BiLAT, learners meet with one or more characters to achieve 
a set of predefined objectives. For example, the learner may need 
to convince a high-ranking local official to stop imposing certain 
taxes on his people or reach an agreement about who will provide 
security at a local marketplace. In all cases, the learner is required 
to adhere to Arab business cultural expectations and norms 
(Nydell, 2006), establish a relationship through building trust, and 
apply integrative negotiation techniques. Specifically, BiLAT is 
designed as a practice environment for learning win/win negotia- 
tion techniques, which promotes the idea that negotiation counter- 
parts should proactively strive to meet each other’s needs as well 
their own (Fisher, Ury, & Patton, 1991). To achieve this in BiLAT, 
learners must also apply their understanding of the character’s 
culture to modify their own communicative choices (Landis, Ben- 
nett, & Bennett, 2004). BiLAT’s focus is on the communicative 
intent and the structure of meetings and does not seek to teach new 
languages. 

Screenshots of the BiLAT interface are shown in Figure 1. On 
the left is one of several navigation screens used in the game. On 
the right is the meeting screen, where learners spend much of their 
time during play. To communicate with the virtual character (..e., 
apply message production skills), the learner selects from a menu 
of about 70 conversational actions that can vary between scenarios. 
For example, the learner can engage in small talk (e.g., “talk about 
soccer’), ask questions (e.g., “ask who 1s taxing the market” and 
“ask if he enjoys travel’), and state intentions (e.g., “say you are 
interested in finding a mutually beneficial agreement”), among 
other possibilities. Physical actions are also available (e.g., “re- 
move sunglasses” or “give medical supplies”). Corresponding 
dialogue text is displayed in a dialogue pane while the character 
responds with synthesized speech and animated gestures. 

BiLAT characters possess culturally specific models of how 
they expect meetings to progress. This progression includes ex- 
pectations for an opening phase, a social period, a business period, 
and a closing social period. These phases are derived from live role 
playing sessions and cognitive task analysis performed with 
subject-matter experts early in the development of BiLAT (J. M. 
Kim et al., 2009). An example of a knowledge component taught 
by BiLAT is to follow the lead of your host. If a learner chooses 
an action that is not appropriate for the current phase of a meeting, 
the character will respond negatively. Trust, which is directly 


affected by the ability of the learner to take appropriate and 
effective actions, is a major factor in whether BiLAT characters 
will be agreeable or difficult. When trust is not established, it is 
often impossible to achieve all necessary agreements because the 
character will not be as interesting in working together. This means 
that learners often need multiple follow-up meetings with the same 
character to achieve objectives and to try different strategies for 
building trust. 


Intelligent Tutoring in BiLAT 


The intelligent tutoring system in BiLAT provides feedback to 
learners as they interact with characters. It is based on knowledge 
components that were identified during the initial cognitive task 
analysis and uses them to maintain a learner model and generate 
the content for feedback messages (Lane et al., 2008). Help can 
come in the form of feedback about a previous action (e.g., explain 
a reaction from the character by describing an underlying cultural 
difference) or as a hint about what action is appropriate at the 
given time. Both types of messages appear in the BiLAT dialogue 
pane (shown in Figure 1). Further, the system implements an 
adjustable model-scaffold—fade algorithm that reduces coaching 
support with increased time and successful interactions (Collins, 
Brown, & Newman, 1989). 

Assessment of learner actions is driven by a model of intercul- 
tural interactions for Arab business meetings. We defined a typing 
system for the lowest level elements in the knowledge component 
hierarchy to facilitate the ITS’ understanding of the different 
categories of message production. These include required steps, 
usual steps (but not required), rules of thumb, and avoids and are 
identified as tags on steps in recipes for achieving certain com- 
munication or negotiation goals. For example, the knowledge 
components include recipes for greeting, socializing, eliciting the 
perspective of the counterpart, asking about local events, and 
more. Which tags belong in which recipes was completed as a joint 
authoring effort between researchers and subject-matter experts. 

These scenario-independent recipes were then mapped into 
communicative actions available in the game. This allowed the ITS 
to track learner actions in terms of knowledge components and 
evaluate actions as positive or negative instances of understanding 
those components. This authoring task was also performed jointly 
between researchers and subject-matter experts. It was often nec- 
essary to assign two links to some actions that had both positive 
and negative elements. For example, if a learner promises to give 
a character what she or he wants, the relationship with that char- 





Figure 1. 


Screenshots of Bilateral Negotiation Trainer, a serious game for intercultural communication. 
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acter may be enhanced, but the promise could lead to problems 
down the road (e.g., the character’s neighbors may grow jealous 
and demand the same favors). These trade-offs were highlighted 
by the ITS when they occurred—there were usually reasons to take 
the action (or respond) and reasons not to do it, and the best choice 
depended on the payoff and specific problem being solved. 


Measuring Learning From BiLAT and the ITS 


In the experiments described, we used two measures to evaluate 
learning produced by BiLAT and the ITS. The first measure was 
a situational judgment test (SJT). In general, SJTs present several 
domain-relevant scenarios, each of which is accompanied by sev- 
eral actions that the learner might perform in response to the 
scenario (Legree & Psotka, 2006). The participants provided 
Likert-scale ratings for each action (0 = very poor action, 5 = 
mixed/OK action, 10 = very good action). There were eight total 
scenarios and 28 total actions in the SJT (these items were pro- 
vided by an external team at the U.S. Army Research Institute). 
The following is an example situation and to-be-rated actions: 


Major Cross and Hamad are wrapping up their meeting, right on 
schedule. There are only a few minutes left in the allotted time for the 
meeting. Before the meeting, Hamad explained that he would need to 
leave at a particular time so that he is able to get to the mosque in time 
for afternoon prayer. Rate the following ways in which Major Cross 
could end the meeting. 


(0-10) Revisit any results of the meeting that were unsatisfactory 
and try to work them out. 


(0-10) ___ Make sure Hamad clearly understands all agreements. If 
the meeting runs a little longer than scheduled, it is okay. 


(0-10) ___ Spend some social time together and remind Hamad that 
his friendship is valuable. 


To score the participant responses, we used ratings provided by 
three subject-matter experts. Understanding of the domain knowIl- 
edge is defined as the degree that a participant’s responses corre- 
late with the experts’ responses (Legree & Psotka, 2006). The test 
was administered in a pretest—posttest design, and so learning was 
defined as the increase in the correlation from pretest to posttest. 
Because the situational judgment test focused on the participants’ 
ability to recognize and understand concepts about intercultural 
interaction, it measured learning at the lower levels of Bloom’s 
taxonomy of cognitive skills (L. W. Anderson & Krathwohl, 
2001). 

The second measure was an in-game posttest that focused on a 
new issue (supply thefts from an Iraqi hospital rather than the 
market). During the participants’ meetings with a hospital admin- 
istrator, no feedback was provided. For each action that a partic- 
ipant selected during these meetings, we examined the probability 
that it was inappropriate. Participants who made fewer errors were 
said to have learned more about intercultural interaction than were 
participants who made more errors. We also examined the prob- 
ability that the participant was able to successfully negotiate with 
the hospital administrator. Although it was a binary measure, 
success indicated that the participants were able to build up trust 
and consider their partner’s needs effectively. Because the in-game 
posttest measured the participants’ ability to apply what they 
learned about intercultural interaction, it measured learning at the 
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middle levels of Bloom’s taxonomy of cognitive skills (L. W. 
Anderson & Krathwohl, 2001). 


Experiment 1: Fidelity and Presence 


Although not the only method, one approach to measuring 
engagement is by investigating to what a degree a system can 
establish a sense of presence. One way to induce a sense of 
presence is to provide greater visual and auditory fidelity (e.g., 
more realistic graphics and sound; Lombard & Ditton, 1997). 
Intuitively, it seems that greater sensory fidelity should also pro- 
mote better training; this is a common point of emphasis in training 
system design requirements. However, recent studies on the effect 
of presence on training suggest that engagement and effective 
outcomes are enhanced by greater sensory fidelity, but learning 
does not necessarily improve (Rowe, Shores, Mott, & Lester, 
2011). An exception is the case in which a specific task domain 
requires high-sensory-fidelity simulation (e.g., a flight simulator), 
but most systems with greater sensory fidelity are not necessarily 
better trainers as a result. 

Experiment | was designed to determine whether a social sim- 
ulator must have high visual and auditory fidelity in order to 
effectively engage and instruct. We therefore created two versions 
of the system. Both versions had the high social fidelity of the 
standard BiLAT experience: rich characters, extensive dialogue, 
intricate character backgrounds, and so forth. But only one version 
had the rich visual and auditory experience; the other used a static, 
primarily text-based interface. On one hand, because BiLAT is 
essentially a social-skills trainer, the difference in visual and 
auditory fidelity may not have affected either presence or learning. 
On the other, given the tendency of people to treat virtual human 
interactions as being real (Gratch et al., 2007), we anticipated some 
advantages for the high fidelity version of BiLAT, including a 
deeper sense of presence and realism. 


Method 


Participants. The participants were 46 U.S. citizens (recruited 
from college campuses) who received $60 as compensation for 
approximately 3 hr of participation. 

Measures. We used the SJT in a pretest—posttest design, as 
described previously. We also used the in-game posttest described 
and analyzed, for each participant, the number of actions they took 
and the amount of time they deliberated between actions. Partici- 
pants who took more actions and deliberated for less time were 
thought to be less engaged or to be taking the experience less 
seriously. 

We added a new measure to capture how engaged the partici- 
pants were while playing BiLAT: the Temple Presence Inventory 
(TPI). The TPI is a validated battery of self-report Likert-scale 
ratings that attempt to measure how engaged or immersed one is in 
a multimedia experience (Lombard & Ditton, 1997). We used two 
subscales from the TPI: the Social subscale and the Spatial sub- 
scale. An example of a Social subscale item is “How often (1 = 
never, 7 = always) did it feel as if someone you saw/heard in the 
environment was talking directly to you?” An example of a Spatial 
subscale item is “How much (1 = not at all, 7 = very much) did 
it seem as if you could reach out and touch the objects or people 
you saw/heard?” (Items on the Spatial subscale that addressed 
sound or animation were removed.) 
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Design and procedure. After providing consent, the partici- 
pants completed the pretest SJT (online, administered by a survey- 
hosting website). Within a few days, the participants arrived at our 
Institute and were randomly assigned to one of the two between- 
subjects conditions. In the three-dimensional (3-D) condition, the 
participants played BiLAT with the rich, immersive interface 
previously described. In the 2-D condition, the participants played 
BiLAT with a nonimmersive, static, text-focused interface (shown 
in Figure 2). The 2-D interface had neither animation nor sound 
but was otherwise equivalent to the 3-D interface. That is, the 
characters and coach functioned identically in both conditions. The 
participants then completed the in-game posttest (using the same 
interface as they used with the market scenario). They then com- 
pleted the two subscales of the TPI described previously. Finally, 
they completed the posttest SJT. 


Results 


Presence and immersion. The primary results from Experi- 
ment | are split into Table 1 and Table 2. Table | presents the 
participants’ self-reported presence, the number of meetings they 
conducted with each character, and the number of actions they 
took in each meeting. 

As can be seen, there was a main effect of interface on self- 
reported presence. The 3-D interface yielded higher spatial pres- 
ence ratings than did the 2-D interface: t(44) = 3.091, p = .003, 
partial n> = .178. The 3-D interface also yielded higher social 
presence ratings than did the 2-D interface: (44) = 2.542, p = 
.015, partial n* = .128. 

There was also a main effect of interface on how the participants 
interacted with the virtual characters. (A software error corrupted 
the logs for two participants. Their data did not contribute to this 
analysis.) The participants conducted more meetings in the 2-D 
interface than in the 3-D interface: 1(42) = 3.143, p = .003, partial 
7 = .190. During each meeting, the participants performed more 
actions in the 2-D interface than in the 3-D interface: #(42) = 
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2.546, p = .015, partial n> = .134. Summed across meetings, 
participants performed nearly 50% more actions in the 2-D inter- 
face than in the 3-D interface in approximately the same amount of 
time. A similar pattern of results appeared in the in-game posttest. 
The 3-D interface, it appears, caused people to think more about 
their actions than did the 2-D interface. 

Learning. Table 2 presents the participants’ SJT scores and 
their performance on the in-game posttest. 

Declarative knowledge. A repeated-measures mixed analysis 
of variance (ANOVA) revealed that there was not a main effect of 
interface on the participants’ pretest—posttest gain: F(1, 44) < 1, 
ns. A main effect was not obscured by pretest differences between 
the two conditions; participants assigned to the 2-D interface did 
not score reliably higher than those assigned to the 3-D interface: 
(44) = 1.330, p = .191. Thus, although the 3-D interface created 
more presence, it did not produce learning gains at the lower levels 
of Bloom’s taxonomy (L. W. Anderson & Krathwohl, 2001). 

Application and transfer. There was also not a main effect of 
interface on the probability of successful negotiation during the 
in-game posttest: F(1, 44) = 1.208, p = .278. Finally, there was 
not a main effect of interface on the probability of making an error 
during the in-game posttest: (40) = 1.536, p = .132. Along with 
the SJT data, it seems clear that greater visual fidelity—and the 
spatial and social immersion it generates— does not appear to have 
a substantial effect on learning cross-cultural interactions as ad- 
dressed in BiLAT. 

Overall gains. We conducted additional analyses of the SJT 
data to examine the overall gains produced by interacting with 
BiLAT and the ITS. A repeated-measures ANOVA revealed 
that there was a main effect of practice on the improvement 
from pretest to posttest. Correlation with subject-matter experts 
increased from pretest (M = 0.56, SE = 0.03) to posttest (M = 
0.72, SE = 0.08): FC, 44) = 40.039, p < .001 partial n? = 
.476. Overall, it appears that BiLAT and the ITS can effectively 
increase knowledge about how to interact in a intercultural 
context. 


only going to discuss it because I'm also 
interested in helping the town, I'm also going 
to expect your friendship and cooperation as 


a result, 


Action 


Coach: Explaining issues in clear terms is necessary 


Demand compliance in finding insurgents 
What do you think of the recent terror attacks? 
1 Mita the water stuatione 
What is your connection to known insurgents? 
What's your opinion on the U.S, occupation? 
Would you cooperate with rivals? 
Bring additional soldiers into meeting area possible. 
Chat for a few minutes before leaving 
Excuse self politely 
Have soldiers menace Abdallah 


Leave abruptiy 


Coech: What are the topics at hand? 


You; What is the water situation? 


Abdallah: Very well, I respect your position. 


How can you get 

to know about 

Abdaliah's views 
| better? 


Coach: You are right to solicit Abdallah's views as often as 


Abdallah: Why? Are you offering to magically 
bring rain to the desert along with 
democracy? 


Coach: How can you get to know about Abdailah's views 


Leave weapons better? 
Remove body armor 
Remove helmet and sunglasses. 


Show Abdaiiah evidence linking him to Hassan and 


PERFORM ACTH 


Figure 2. 





A screenshot of the two-dimensional interface for Bilateral Negotiation Trainer. 
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Discussion 


The 3-D version of BiLAT (with animated virtual humans and 
synthesized speech) produced a greater sense of presence than did 
the 2-D interface. However, according to the SJT, there were no 
differences in declarative knowledge gains between the two con- 
ditions. Thus, the 3-D interface did not appear to improve BiLAT’s 
teaching efficiency. 

However, there were several differences in the how users in the 
two conditions interacted with the characters. Learners in the 3-D 
environment deliberated longer and, correspondingly, needed 
fewer actions in order to succeed. Research on rapport with virtual 
humans has shown that people react to virtual humans as if they 
are real (Gratch et al., 2007). One possible explanation for the 
interface-driven behavioral differences is that learners were more 
concerned about the impacts of their choices and thus thought 
them through more carefully. They may have been using that time 
to generate better mental simulations of the conversation. They 
may also have been establishing better expectations or generating 
better hypotheses about the mental state of their meeting partner. 
Future studies would be necessary to determine why users delib- 
erate longer with embodied characters and how they are using that 
time. 


Experiment 2: Formative Feedback 


The results of Experiment | suggested that the content—not the 
appearance— of the system appeared to be responsible for learn- 
ing. We therefore designed Experiment 2 to focus on the effec- 
tiveness of that content by examining the hints and feedback 
provided by the ITS. Some participants received formative feed- 
back (Shute, 2008), which emphasizes productive revisions of 
knowledge. This feedback was very conceptual in nature (e.g., “Be 
sure to avoid appearing overly defensive or protective’). Other 
participants received very helpful, but very specific, assistance 
(e.g., “You are still in full combat gear, including your helmet and 
sunglasses”). 

Prior studies have found that learners who struggle during 
training eventually prosper as a result (Bjork, 1994; VanLehn, 
1988). We believed that the specific feedback would be more 
helpful during practice and be appealing to learners (since it told 
them exactly what to do), but that the conceptual feedback would 
require the participants to deliberate more and think more deeply 


Table 1 
Temple Presence Inventory and Meeting—Interaction Data From 
Experiment I by Condition 





eT No. of interactions 
Social Spatial Meetings Actions 
Interface Mean SE Mean SE Mean SE Mean SE 








2-D amen (81) 2 Mame f/am) amma of Om) 
3-D BAD Oils oe O21 O30 5 OL605 3 15:09" 10254 





Note. Participants’ self-reported presence, the number of meetings they 
conducted with each character, and the number of actions they took in each 
meeting. TPI = Temple Presence Inventory; Social = Social subscale of 
the TPI; Spatial = Spatial subscale of the TPI; SE = standard error; 2-D = 
two-dimensional interface; 3-D = three-dimensional interface. 
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Table 2 
Situational Judgment Test and In-Game Posttest Data From 
Experiment I by Condition 





SJT In-game posttest 
Pretest Posttest p (success) p (error) 
Interface Mean SE Mean SE Mean SE Mean SE 
2-D 59 04 ve} 03 noi, ml aS, 02 
3-D 2 04 a, 03 SS 10 27 02 


Note. Participants’ Situational Judgment Test (SJT) scores and their 
performance on the in-game posttest. 2-D = two-dimensional interface; 
3-D = three-dimensional interface; SE = standard error. 


about the principles of cross-cultural interaction. We therefore 
predicted a greater increase in declarative knowledge (greater SJT 
improvement) and better transfer to new contexts (greater in-game 
posttest performance) for participants in the formative feedback 
condition. 


Method 


Participants. The participants were 47 U.S. citizens (recruited 
from college campuses) who received $60 as compensation for 
approximately 3 hr of participation. 

Design and procedure. After providing consent, the partici- 
pants completed the pretest SJT online (cf. Experiment 1). Within 
a few days, the participants arrived at our institute and were 
randomly assigned to one of the two between-subjects conditions. 
Some of the participants used BiLAT with a coach that provided 
hints and feedback that were exclusively specific to in-game ac- 
tions. The coach for the other participants provided hints and 
feedback that were exclusively conceptual. The two versions of the 
coach' otherwise behaved identically in the two conditions (e.g., 
they chose when to provide feedback or hints based on the same 
policies). 

The participants then completed the in-game posttest. They were 
then compensated and dismissed. A week later, the participants 
were e-mailed a link to the posttest SJT; 46 of the 47 participants 
completed it after an average of about 2 days. 


Results 


The primary results from Experiment 2 are presented in Table 3. 

Declarative knowledge. There was not a main effect of feed- 
back type on the participants’ pretest—posttest gain: F < 1, ns. 
Their acquisition of declarative knowledge appears to not have 
been influenced by specific versus conceptual feedback. 

Application and transfer. On the in-game posttest, there was 
not a main effect of coach type on the probability of a successful 
meeting outcome: t < 1, ns. However, there was a main effect of 
feedback type on the probability of making an error: (40) = 2.049, 
p = .05, partial n* = .095. Even with equivalent declarative 
knowledge, the participants who encountered the conceptual coach 
were better able to interact with the new character in order to solve 


' Typically, the coach follows a simple policy to decide whether to 
provide specific or conceptual feedback during practice (which is to try 
general first and then shift to concrete if the learner struggles). 
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Table 3 
Effects of Feedback Type on Situational Judgment Test Scores 
and In-Game Posttest Performance 











SJT In-game posttest 
Pretest Posttest p (success) _p (error) 
Coach type Mean SE Mean SE Mean SE Mean SE 
Specific feedback 3S Of 45 2 a CS 2 Ly 


Conceptual feedback 56 .04 .74 02 .76 .10 15 .02 





Note. SJT = Situational Judgment Test; SE = standard error. 


the problem. This pattern of results is consistent with the notion 
that formative feedback can be superior to simple performance- 
oriented feedback (Shute, 2008). Further, the disparity between the 
in-game posttest results and the SJT is consistent with our belief 
that these two measures operated at different levels of Bloom’s 
taxonomy of learning (L. W. Anderson & Krathwohl, 2001) tax- 
onomy (Hays, Ogan, & Lane, 2010). 

Overall gains. On the SJT, there was a main effect of practice 
with BiLAT. A repeated-measures ANOVA revealed that the 
participants’ SJT scores increased from pretest to posttest: F(1, 
44) = 61.169, p < .001, partial n” = .582. As in Experiment 1, it 
appears that the participants learned from their in-game experi- 
ence. 


Discussion 


In Experiment 2, we tested the hypothesis that general feedback 
would be better for learning. The results suggested that conceptual 
feedback transfers more readily than does concrete feedback. Al- 
though we cannot conclude that concrete feedback never has a 
place (extreme versions of the ITS were used in the experiment), 
this study does suggest that for intercultural communication skills, 
a reasonable default setting is to use conceptual feedback first and 
then shift to concrete if future performance gains are not observed. 


General Discussion 


We sought to build a virtual environment for teaching intercul- 
tural communication skills with virtual humans in a specific con- 
text (i.e., Arab business practice). Our approach included the use 
of modern game technologies (i.e., a 3-D game engine) and an 
intelligent tutor to scaffold learners as they interacted with those 
characters. We conducted two experiments in which we found 
large overall gains in declarative knowledge as a result of inter- 
acting with this system. We found that the visual fidelity of the 
interface had significant impacts on learner behaviors, perceptions, 
and in-game success. We also found that conceptual feedback 
enhanced learners’ ability to apply the targeted knowledge. At 
least in the context of intercultural communication, the nature of 
feedback had a much greater influence on learning than did visual 
fidelity. 


Fidelity 


For social simulations and virtual humans, the choice of where 
to invest development time is challenging: It is difficult to ignore 
any single dimension but also difficult to develop elaborate models 
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for all relevant aspects of the communicative skill. Our studies 
suggest that for the message reception/production model of com- 
munication with a menu-based simulation of social interactions, 
learning of declarative knowledge is not affected by the richness of 
the sound and animations. However, given the complexities of 
social interactions, there are analogs to other domains that require 
higher fidelity simulations, like flight training. For example, vir- 
tual human agents used for teaching recognition of nonverbal 
behaviors (e.g., in deception detection training) would require a 
higher level of visual fidelity to properly capture and teach the 
subtle elements that are part of the knowledge being covered. 

In BiLAT, learners are practicing the decision making involved 
in intercultural communication and learning what differences re- 
quire attention. Variations in speech and nonverbal behaviors are 
not as critical, given these goals, and so the fidelity important to 
BiLAT learning has most to do with the content of the characters 
utterances, which is driven by the underlying models. The result 
that a richer interface engendered longer deliberations suggests 
that future studies are needed in order to understand the nature of 
how this time is being used: if virtual humans and high-fidelity 
graphics can be linked to greater attention to consequences of 
actions or more self-explanations, then future studies should seek 
to determine if these do in fact contribute to learning. 


Feedback 


Presence is often defined as forgetting that one is having a 
mediated experience. Thus, it is important to understand (a) if the 
use of unsolicited feedback interrupts this experience and (b) if 
that positively or negatively impacts learning. We found no evi- 
dence that the use of feedback (from a disembodied ITS) impacted 
the learner’s sense of presence in either environment. How to 
deliver feedback optimally is an ongoing question for learning 
science researchers. Our study found benefits for using more 
general feedback that, we posit, required the learner to interpret the 
help and apply it to his or her own situation (i.e., it is formative). 
As our study only tested the extremes, an ITS that properly 
balances concrete feedback with conceptual feedback is more 
likely to be effective for the most learners. Future studies should 
focus on various algorithms for comparing different timing and 
content settings. 


Limitations and Future Work 


There were several technical and methodological limitations of 
the present studies. 

Modality. Since communicating with BiLAT characters is 
accomplished through menu-based selection of actions rather than 
free speech input, learners are limited in what they can say and are 
most likely influenced by the choices that are available. This is 
vastly different than being required to generate utterances as they 
would in normal conversation. On the other hand, this design 
choice reflects current limitations of speech input and natural 
language understanding and does provide some structure for nov- 
ice learners (J. M. Kim et al., 2009). Thus, because our measures 
focus on culture at the same level of abstraction as the game, it is 
unclear whether BiLAT practice with coaching would transfer to 
more realistic contexts. Because this is a critically important ques- 
tion, it suggests further study using a more elaborate (and expen- 
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sive) a posttest measure using human role players or perhaps 
virtual humans that are capable of understanding free speech input. 

Feedback. As discussed, feedback in BiLAT is delivered as 
text. An important lesson learned about the delivery of feedback 
resulted from early testing when we discovered that many learners 
were not reading the coaching messages. The reason, we found 
out, was that the BiLAT display included a readout of how much 
“trust” the character felt toward the learner. Thus, people played 
by selecting an action, listening to the character’s response, and 
watching for changes in the trust meter. Because the messages 
were rendered only as text in the dialogue pane (on the other side 
of the screen from the trust meter), they often went unseen. We 
resolved this issue by hiding the trust meter in all of our studies 
and drawing attention to the coach and how it worked when 
introducing the participants to the system. It may be that having a 
trust meter or other visible “score” may have benefits for learning 
(e.g., rapid self-assessments), and thus our studies are limited in 
that they do not address the interaction of learning and game-like 
elements, like score, narrative, or challenge. 

Further, our studies focus only on limited forms of feedback 
delivery. In Experiment 2, we considered two extreme versions of 
feedback— either completely general or completely concrete. It is 
likely that some balance between these is needed, and thus, con- 
tinued study of how the generality of specificity of feedback 
should vary with context is needed. Also, the settings on our 
model-scaffold—fade algorithm were based on trial-and-error and 
observation of learners interacting with the system. Our goal was 
to strike a balance and provide the best level of support given the 
successes (or failures) of the learner, yet we had no theoretical 
support for the settings we used on this algorithm. Although there 
is substantial literature on the form of feedback (Shute, 2008), in 
general, we have found little guidance—empirical or theoretical— 
regarding the timing and optimal rates of fading of tutor support. 
This suggests future studies varying our algorithm and investigat- 
ing the impact of these on performance and learning. 

Measures. Unfortunately, both of the measures we used to 
gauge changes in learners have drawbacks. Although the in-game 
posttest does detect changes in learners’ understanding of some 
culturally based rules of interaction, it is conducted within the 
environment used to teach those rules. As a result, it may only 
reflect shallow learning (i.e., learning to play the game rather than 
learning about the culture) or basic evidence of the existence of 
smooth learning curves. Because the ultimate test of learning is in 
face-to-face interactions with people from the target culture, in- 
game performance measures are inherently suggestive, at best. 

Also, as discussed, although the SJT was designed by an exter- 
nal team, it may not be sufficiently precise to detect learning that 
occurs during BiLAT meetings. Further, it includes content from 
components of BiLAT that are not part of the tutoring system, such 
as preparation (i.e., research on counterparts) and broader scenario 
issues, like following up on commitments and social network 
changes based on the overarching narrative in the scenarios. It 
should also be noted that both these measures focus exclusively on 
the message production side of social interaction. Thus, even 
though the ITS does address message reception skills, our studies 
had no chance to detect any changes in a learner’s ability to 
process and understand the utterances from the virtual human 
characters. 
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Another potential limitation of the SJT is that it uses identical 
prompts on the pretest and posttest. One could therefore argue that 
the overall gains from pretest to posttest we have reported merely 
reflect learning from the test. We took care to reduce this possi- 
bility by modifying some of the prompts to avoid divulging ‘addi- 
tional information (Asher, 2007). Also, because the SJT responses 
are numeric rather than potentially informative solutions (as in 
multiple-choice tests), it is unlikely that the participants used the 
SJT to guide their BiLAT experience so that their posttest score 
would be improved. Nevertheless, multiple counterbalanced ver- 
sions of the SJT would be a more empirically sound measure. 

Many other measures were possible, such as perspective-taking 
instruments (Paige, 2006) and measures to gauge perceived rela- 
tionships with the agent (Ogan et al., 2011). Such measures are 
extremely important in intercultural development because much of 
it involves adjustment of one’s own perspective on self, others, and 
more (Bennett, 1993), and so in future studies, investigators should 
more carefully address the role of feedback and fidelity on these 
factors while respecting the practical limits on testing time used 
during controlled studies. 


Conclusion 


This article began with the question of how virtual human role 
players might be used to enhance the learning of communication 
skills and highlighted the dearth of guidelines, principles, and 
empirical evidence for their design. Broadly, the results of our 
studies support the limited, but growing, body of literature (Dur- 
lach et al., 2008; Surface et al., 2007) that virtual humans can be 
used effectively to improve intercultural communication knowl- 
edge and skills. Generally, learners in both of our experiments 
showed gains in declarative knowledge from pretest to posttest. 
The key takeaway messages from these studies are that (a) the 
fidelity of such systems should be a function of the domain 
knowledge being taught and (b) feedback can be given in such a 
way that it enhances future performance and does not distract from 
the immersive nature of the system. Although our studies were not 
specifically “design” studies, further investigation of precise ma- 
nipulations of virtual human content, behavior, and interaction 
modalities is definitely necessary. As with many advanced tech- 
nologies (games, mobile devices, and so on), the number of avail- 
able systems from the commercial and research sectors is rapidly 
growing, and so there is an urgent need for empirically derived 
guidelines and principles for using and scaffolding learning with 
virtual humans. 

Many open questions remain about the use of virtual humans in 
social skills training and education. We believe future work is 
needed to develop and test new measures of learning and perspec- 
tive change and to understand the role of feedback in these envi- 
ronments. As virtual humans continue to approach live human role 
players in realism, continued experimental research that focuses on 
the nature of these interactions, the sophistication of their imple- 
mentations, and the role of supporting technologies such as intel- 
ligent tutoring is certainly merited. 
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Motivation and Performance in a Game-Based Intelligent Tutoring System 


G. Tanner Jackson and Danielle S. McNamara 
Arizona State University 


One strength of educational games stems from their potential to increase students’ motivation and 
engagement during educational tasks. However, game features may also detract from principle learning 
goals and interfere with students’ ability to master the target material. To assess the potential impact of 
game-based learning environments, in this study we examined motivation and learning for 84 high-school 
students across eight l-hr sessions comparing 2 versions of a reading strategy tutoring system, an 
intelligent tutoring system (iSTART) and its game-based version (iSTART—ME). The results demon- 
strate equivalent target task performance (i.e., learning) across environments at pretest, posttest, and 
retention, but significantly higher levels of enjoyment and motivation for the game-based system. 
Analyses of performance across sessions reveal an initial decrease in performance followed by improve- 
ment within the game-based training condition. These results suggest possible constraints and benefits of 
game-based training, including time-scale effects. The findings from this study offer a potential expla- 
nation for some of the mixed findings within the literature and support the integration of game-based 
features within intelligent tutoring environments that require long-term interactions for students to 


develop skill mastery. 


Keywords: educational games, intelligent tutoring, motivation and performance 


Intelligent tutoring systems (ITSs) are automated tutoring envi- 
ronments that adapt to users based on various well-established 
cognitive principles and algorithms (Anderson, 1982). This ap- 
proach has been highly successful for the last several decades as 
evidenced by significant learning gains in studies covering a wide 
range of domains (e.g., Cohen, Kulik, & Kulik, 1982; Graesser, 
McNamara, & VanLehn, 2005; Merrill, Reiser, Ranney, & Traf- 
ton, 1992). However, one potential weakness of long-term ITSs is 
that while novel to students at first, they can become repetitive 
over time. This facet is a particular problem when the targeted skill 
or knowledge requires extended practice to reach sufficient mas- 
tery or depth of understanding. 

An increasing number of long-term tutoring systems focus on 
prolonged skill acquisition across multiple interactions, and sev- 
eral of these have been integrated and evaluated within ecological 
settings (Jackson, Boonthum, & McNamara, 2010; Johnson & 
Valente, 2008; Koedinger & Corbett, 2006; Meyer & Wijekumar, 
2011). Due to the extended time span of these interactions, stu- 
dents can sometimes become disengaged and bored while using 
some systems (e.g., Arroyo et al., 2007; Baker, D’ Mello, Rodrigo, 
& Graesser, 2010; Bell & McNamara, 2007). When the learning 
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process is expected to require multiple days, weeks, or months, 
designing the environment such that it induces the learner to 
persist should be paramount among the design objectives. If stu- 
dents do not remain engaged and persist within a training envi- 
ronment, attaining a long-term learning objective is nearly impos- 
sible. Furthermore, for those students who continue to interact 
despite lack of interest, boredom may trigger a vicious cycle that 
prevents them from actively re-engaging in constructive learning 
processes (Baker, Corbett, & Koedinger, 2004; D’Mello, Taylor, 
& Graesser, 2007). 

Educational systems that require a longer training commitment 
may benefit from design features that enhance student engagement 
after any novelty effects have dissipated. Nonetheless, there is 
more to learning than interest and engagement. Sacrificing essen- 
tial pedagogical aspects of an educational environment to increase 
interest is not likely to be successful. As these constraints have 
become more evident, system designers have begun to carefully 
incorporate educational games and game-based elements to help 
capture students’ interest and promote active participation within 
learning environments (McNamara, Jackson, & Graesser, 2010). 


Game-Based Learning 


It is intuitively clear that games are a potentially strong moti- 
vating factor for students (Gee, 2003; Steinkuehler, 2006). A 
natural, intrinsic interest in the domain content of the system is, of 
course, the preferred method of obtaining involvement, but unfor- 
tunately not all learners share interests. While the content itself 
plays an important role for determining interest, perhaps the fram- 
ing of this content (e.g., incorporating it within a game) is even 
more crucial. Thus, a game itself can be used as a catalyst to 
promote motivation and sustain the interest of students. 

The increased focus on games in education may also be partially 
due to the alignment between aspects of game design and the goals 
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of educational environments. This is not simply a grafting of two 
successful but incompatible technologies; research suggests that 
these technologies have a common theoretical foundation and that 
the sum is greater than the parts (e.g., Laird & van Lent, 2000; Van 
Eck, 2006). Specifically, an essential overarching benefit to games 
is that they, similar to tutoring systems, provide the opportunity for 
adaptive, individualized interactions. The notion behind these 
highly interactive educational games is to invoive learners, giving 
them opportunities to perform, experience outcomes, and reflect 
on the targeted tasks such that these actions are integrated within 
a meaningful context (Barab et al., 2010). 

Games often improve engagement and motivation by employing 
features similar to those found within successful tutoring systems. 
For example, one of the many motivating factors of games is the 
individual and personalized nature of the interactions that adapt to 
the skills and actions of the player (Gee, 2005; Malone & Lepper, 
1987; Rieber, 1996). To accomplish this goal, an educational game 
must be able to identify the ability level of the learner and adjust 
itself accordingly (Conati, 2002; Rieber, 1996; Shute & Towle, 
2003). As such, the game may require demonstration of more 
advanced skills or knowledge from a learner progressing success- 
fully through the game or lessen the requirements for a learner 
progressing poorly. Additionally, the rapid feedback within edu- 
cational games can help learners to better regulate their progress 
and activities. Indeed, the role of feedback in any learning envi- 
ronment can lend a stronghold on engagement (Anderson, Corbett, 
Koedinger, & Pelletier, 1995; Corbett & Anderson, 1990; Foltz, 
Gilliam, & Kendall, 2000). By leveraging these features to in- 
crease engagement and motivation, these games are highly com- 
patible with the sophisticated pedagogy implemented within most 
ITSs. 

Another aspect of games that maps onto pedagogical goals is the 
notion of challenge (i.e., task difficulty; Gredler, 2004; Rieber, 
1996). Games that are easily won require little effort from learners. 
On the other hand, games that are too difficult can result in 
lowered interest because learners are unable to accomplish goals. 
Vygotsky (1978) posited that learning is most effective when the 
material is slightly more advanced than the learner. With respect to 
game challenge, the same hypothesis could apply. A game that is 
slightly more challenging than the learner’s skill and knowledge 
may sustain interest and motivation by providing accomplishment 
while maintaining effort (Gee, 2003). Indeed, self-efficacy and 
interest in games have been found to be highly correlated (Zim- 
merman & Kitsantas, 1997). Ratings of higher self-efficacy during 
game play coincide with higher preferences for one game over 
others. Thus, accomplishment by the players over consistent chal- 
lenges should raise their self-efficacy, overall enjoyment, and 
motivation to perform the task. 


Motivation and Mastery in Educational Games 


Ample research shows that learning (and mastery) is more than 
just a cognitive process (du Boulay, 2011); learning is as much a 
motivational and affective task as it is a demonstration of mental 
ability. Research also suggests that there is an indirect link be- 
tween motivation and learning (Garris, Ahlers, & Driskell, 2002); 
namely, motivation influences the learning processes in which 
students engage. And, these processes subsequently affect learning 
outcomes. 
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Motivation is a multidimensional construct that subsumes a 
number of component factors, such as interest, enjoyment, expec- 
tancies, and values. For the current work, motivation generally 
refers to students’ desire to perform a task and willingness to 
expend effort on that activity (Garris et al., 2002; Pintrich & 
Schrauben, 1992; Wolters, 1998). This broad conceptualization of 
motivation encompasses previous research examining both intrin- 
sic and extrinsic factors related to interest, engagement, enjoy- 
ment, and self-efficacy. This prior work has indicated that enhanc- 
ing these aspects of motivation positively impacts learning 
(Alexander, Murphy, Woods, Duhon, & Parker, 1997; Bandura, 
2000; Pajares, 1996; Pintrich, 2000; Young et al., 2012; Zimmer- 
man & Schunk, 2001). Other research has demonstrated that 
various mechanisms common to games, such as feedback, incen- 
tives, task difficulty, and control, can have a significant impact on 
these motivational constructs and, hence, may ultimately affect 
learning (Conati, 2002; Corbett & Anderson, 2001; Cordova & 
Lepper, 1996; Graesser, Chipman, Leeming, & Biedenbach, 2009; 
Malone & Lepper, 1987; Moreno & Mayer, 2005; Shute, 2008). 

Many games leverage these mechanisms and other features as 
part of a core game design. No individual feature is required within 
a game, and some game elements may even be unnecessary, 
ineffective, or distracting in the short term, but they may also have 
the potential to increase interest, enjoyment, and engagement in 
the long term. Previous research has also suggested that the affec- 
tive benefits from games may increase as the number of incorpo- 
rated game-based features increases (Cordova & Lepper, 1996; 
Papastergiou, 2009). Therefore, some researchers have assumed 
that combining several game features together will provide stu- 
dents with a more enjoyable interaction (Asgari & Kaufman, 2004; 
McNamara et al., 2010). 

Unfortunately, despite the increase in research related to edu- 
cational games, there remains a dearth of research in which the 
effectiveness of these new gaming environments have been di- 
rectly compared with their natural counterpart, traditional intelli- 
gent tutoring environments (O’Neil & Fisher, 2004; O'Neil, 
Wainess, & Baker, 2005). Two recent studies have been conducted 
in which researchers have directly investigated the effectiveness 
and benefits of educational game components compared with an 
ITS. The first study by Jackson, Dempsey, and McNamara (2012) 
was a 90-min experiment to compare the short-term practice 
effects of a traditional ITS environment with a game-based coun- 
terpart. They found that participants who had interacted with 
game-based practice rated it as significantly more engaging than 
students within the traditional ITS. By contrast, students who 
interacted with the traditional ITS outperformed students who 
practiced using the game environment. 

A second smaller study was conducted over a longer time span 
(six separate sessions) to investigate a combined system that 
allowed users to continually choose between practicing with an 
ITS or a game-based system (Jackson, Dempsey, Graesser, & 
McNamara, 2011). Participants in this study completed a 2-hr 
introductory training session before entering the practice environ- 
ment where they could choose between systems (for the remaining 
4—5 hr across sessions). Focusing on the results comparing the 
same two systems from Jackson et al. (2012), there were no 
advantages for the traditional ITS in this longer term study in terms 
of performance (comparing within-subjects). The students per- 
formed equally well within both systems. In addition, although 
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there were trends showing improved enjoyment for the game- 
based system over the ITS, the difference was not statistically 
significant. The findings from these studies helped to motivate the 
current study, provide support for differing hypotheses (discussed 
in more detail later), and suggest that the current study is needed 
to further explore the complex interplay between games and learn- 
ing (also see Harris, 2008). 

The current work aims to more directly address these issues in 
game-based learning by comparing the outcomes from two similar 
long-term skill acquisition systems: a traditional ITS (START) 
and an educational game (START—ME). 


iSTART and iSTART-ME 


The Interactive Strategy Training for Active Reading and 
Thinking-Motivationally Enhanced GSTART-—ME) tutor is a 
newly developed game-based learning environment built on top of 
an existing tutoring system (START). iSTART provides young 
adolescents to college-age students with comprehension strategy 
training to better understand and learn from challenging science 
texts (McNamara, Levinstein, & Boonthum, 2004; McNamara, 
O’Reilly, Best, & Ozuru, 2006). In iSTART, pedagogical agents 
instruct trainees in the use of self-explanation and other active 
reading comprehension strategies to explain the meaning of sci- 
ence text while they read. The training was motivated by empirical 
findings showing that students who self-explain text are more 
successful at solving problems, more likely to generate inferences, 
construct more coherent mental models, and develop a deeper 
understanding of the concepts covered in text (Chi, Bassok, Lewis, 
Reimann, & Glaser, 1989; Chi, de Leeuw, Chiu, & LaVancher, 
1994; McNamara, 2004). 
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iSTART Modules 


Strategy instruction occurs in three stages with each stage requiring 
increased interaction on the part of the learner. During the Introduc- 
tion Module of iSTART, a trio of animated characters introduces 
students to the concept of self-explanation and associated reading 
strategies by providing information, posing questions, and discussing 
examples. In the second phase, called the Demonstration Module, two 
agents demonstrate the use of self-explanation using a science text, 
and the trainee identifies the strategies being used by the agents. 
During this module, the teacher character (Merlin) asks the trainee to 
indicate which strategies the student agent (Genie) employed in pro- 
ducing his self-explanation. Finally, Merlin gives Genie feedback on 
the quality of his self-explanation. 

In the third phase (Practice), Merlin coaches and provides feedback 
while the trainee practices self-explanation using the repertoire of 
reading strategies. The goal is to help the trainee acquire the skills 
necessary to integrate prior text and prior knowledge with the current 
sentence content. For each sentence, Merlin reads the sentence, asks 
the trainee to explain it by typing a self-explanation, and provides 
feedback on the quality of the explanation. 

The iSTART assessment algorithm drives the feedback pro- 
vided by Merlin. The algorithm output is coded as a 0, 1, 2, or 3. 
An assessment of 0 indicates that the self-explanation was either 
too short or contained mostly irrelevant information. An iSTART 
score of | is associated with a self-explanation that primarily 
relates only to the target sentence itself (sentence-based). A 2 
means that the student’s self-explanation incorporated some aspect 
of the text beyond the target sentence (text-based). If a self- 
explanation earns a 3, then it is interpreted to have incorporated 
information at a global level and may include outside information 
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Figure 2. Screenshot of iSTART-ME selection menu. 


or refer to an overall theme across the whole text (global-based). 
This algorithm has demonstrated performance comparable to that 
of humans and provides a general indication of the cognitive 
processing required to generate a self-explanation (Jackson, Guess, 
& McNamara, 2010). 

Within iSTART, there are two types of practice modules. The 
first practice module is situated within the core context of iSTART 
(initial 2-hr training) and includes two texts. The second practice 
module is a form of extended practice, which operates in the same 
manner as the regular practice module. This extended practice 
phase (called Coached Practice—see Figure | for a screenshot) is 
designed to provide a long-term learning environment that can 
span weeks or months. Research has shown that this extended 
practice increases students’ performance over time (Jackson, 
Boonthum, et al., 2010). However, one unfortunate side effect of 
this long-term interaction is that students often become disengaged 
and uninterested in using the system (Bell & McNamara, 2007). 


iSTART-ME 


Previous research with iSTART pointed to the need for students 
to persist within the system across several days of training. There- 
fore, changes were implemented within the system to combat the 
problem of disengagement over time. The extended practice mod- 
ule of iSTART was redesigned and situated within a game-based 
environment called iSTART—ME (Motivationally Enhanced). This 
game-based environment was built directly on top of the existing 
iSTART system. The main goal of the iSTART-ME project was to 
implement several of the game-based principles and mechanisms 
that were expected to support effective learning, increase motiva- 
tion, and sustain engagement throughout a long-term interaction 
with an established ITS. The project attempted to implement and 
potentially manipulate these motivational constructs via game- 
based features that map onto one of five interaction mechanisms: 
feedback, incentives, task difficulty, control, and environment (see 
McNamara et al., 2010, for more details on the mechanisms). 

The original ITS version of iSTART with Coached Practice 
automatically progresses students from one text to another with no 
intervening tasks. The new version of iSTART-ME is situated 
within a cohesive meta-game and point-based economy that the 


user can control through a selection menu (see Figure 2 for 
screenshot). This new selection menu provides students with op- 
portunities to interact with new texts (control/task difficulty), earn 
points and trophies (feedback/incentives), advance through levels 
(feedback/incentives), unlock new features (control/incentives/en- 
vironment), purchase rewards (control/incentives/environment), 
personalize a character (control/incentives/environment), and play 
educational mini-games (control/incentives/task difficulty). 

Within isSTART-ME, students earn points as they interact with 
texts and provide their own self-explanations. Each time a student 
submits a self-explanation, it is assessed by the iSTART algorithm 
and points are awarded based on a scoring rubric. The rubric has 
been designed to reward consistently good performance. So stu- 
dents earn more points if they repeatedly provide good self- 
explanations but earn fewer points if they fluctuate between good 
and poor performance. These points help go beyond the qualitative 
responses from the animated agents to provide an additional, 
quantifiable form of feedback as students learn and practice the 
self-explanation strategies. For example, students can easily un- 
derstand that a score of 30 is better than a score of 10, but it is more 
difficult to gauge the relative difference between, “All right, let’s 
keep going” and “You're starting to get the hang of this.” In 
addition to serving as a form of feedback, earning points within 
iSTART-—ME serves two main incentive purposes: advancing 
through levels and purchasing rewards. 

As students accumulate more points, they advance through a 
series of levels. Each subsequent level requires an increasing 
number of points. Therefore, students must expend slightly more 
time or effort for further advancement (i.e., increasing task diffi- 
culty to reach a new level). Whenever students advance up a level, 
a new subset of features is automatically unlocked and becomes 
available within the interface (thus acting as an incentive and 
providing additional control). Each of the isTART levels are 
labeled (e.g., ultimate bookworm, serious strategizer) to help pro- 
vide incentive, increase interest, and serve as global indicators of 
progress across texts. 

Points can also be used to control the environment by “purchas- 
ing” incentives within the system. One of the options available 
as a reward allows students to change aspects of the learning 
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Player 1 wins this point 








Figure 3. Screenshot of Showdown in iSTART—ME. 


environment. They can spend some of their points to choose a 
new tutor agent, change the interface to a new color scheme, or 
update the appearance of their personalizable avatar. These 
features provide students with a substantial amount of control 
and personalization over their environment and have been de- 
signed as purchasable replacements, rather than continuously 
available options, to help reduce off-task behaviors (such as 
switching back and forth between agents solely to see what they 
all look like). 

Last, a suite of eight educational mini-games have been de- 
signed and incorporated within the isSTART—ME extended practice 
module. Some mini-games require identification of the type of 
strategy use, while others may require students to generate their 
own self-explanations. The majority of iSTART—ME mini-games 
require similar cognitive processes enveloped within different 
combinations of gaming elements. 

Showdown and Map Conquest are two methods of generative 
game-based practice that use the same iSTART assessment 
algorithm from regular practice. In Showdown (see Figure 3 for 
a screenshot), students compete against a computer player to 
win rounds by producing better self-explanations. After the 
learner submits a self-explanation, it is scored, the quality 
assessment is represented as a number of stars (0-3), and an 
opponent self-explanation is also presented and scored. The 
difficulty of the opponent self-explanation has been manipu- 
lated within previous experiments (Dempsey, 2011); however, 
for normal gameplay, the opponent example is chosen at ran- 
dom to provide a range of student modeling (1.e., good and bad 
examples). The self-explanation scores are compared, and the 
player with the most stars wins the round. The player with the 
most rounds at the end of the text is declared the winner. The 
combination of features for Showdown incorporates aspects of 
feedback (points, stars, rounds won), incentives (points, stars), 


control (production of self-explanation), and task difficulty 
(opponent, text content). 

Map Conquest is the other game-based method of practice 
where students generate their own self-explanations. Within 
Map Conquest, the quality of a student’s self-explanation de- 
termines the number of dice that student earns (i.e., perfor- 
mance at the target task determines the resources available 
during a subsequent game task). Students place these dice on a 
map and use them to conquer neighboring opponent territories, 
which are controlled by two virtual opponents. The surface 
components are somewhat different from those in Showdown 
but were similarly designed to provide the user with feedback 
(points, dice), incentives (dice, map puzzle), control (map puz- 
zle, production of self-explanation), and task difficulty (oppo- 
nents, text content). 

In most of the identification mini-games—for example, Bal- 
loon Bust (Figure 4)—students are presented with a target 
sentence and an example self-explanation. The student must 
decide which iSTART strategy was used in the self-explanation 
and then click on the corresponding balloons. There are also 
three other mini-games that focus on the same task of identi- 
fying strategies within example self-explanations. These other 
games each incorporate a new interface with a different com- 
bination of game elements, which might include fantasy, com- 
petition, and perceptual aspects (as in Balloon Bust). Though 
the surface features of these games can differ widely, they have 
been designed with very similar underlying mechanisms and 
can all be completed within 10—20 min. Students are allowed to 
select any form of practice or mini-game from the selection 
menu that has been unlocked (provided that they have enough 
points). After completion of a task, students are directed back to 
the main iSTART-ME selection screen. 
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Figure 4. Screenshot of Balloon Bust in iSTART—ME. 


Current Study 


The current study was a multisession experiment in which the 
effectiveness of a game-based tutoring system (1START—ME) was 
compared with its ITS counterpart G(START-Regular). One pos- 
sible concern with integrating games into learning systems is that 
they have the potential to detract from the immediate pedagogical 
goals and reduce learning improvements in the short term (Jackson 
et al., 2012; Mayer & Moreno, 2003; Paas, Renkl, & Sweller, 
2003). However, across long-term training, the engagement fos- 
tered by the game environment may compensate for any distract- 
ing elements, thus allowing students to catch up in performance 
(Jackson, Dempsey, Graesser, & McNamera, 2011). Hence, this 
study was conducted to thoroughly explore the potential long-term 
benefits of game-based training, how it compares with training 
from a traditional ITS, and how various effects of motivation and 
learning may unfold over time. 


Hypotheses 


The Jackson et al. (2012) study indicated that students who 
received game-based training during early stages of skill acquisi- 
tion exhibited decreased performance at the target task (compared 
with students in a traditional ITS). In contrast, the Jackson, Demp- 
sey, et al. (2011) study showed that when students completed 
initial strategy training within a traditional ITS (.e., no game 
features), subsequent performance during game and nongame 
practice methods was equivalent. Additionally, previous work with 
the game-based aspects in iSTART-ME has shown consistent 
positive effects for motivation and enjoyment (Jackson, Davis, et 
al., 2011; Jackson & McNamara, 2011). This combination of 
results leads to two hypotheses regarding the current study. 


One hypothesis is that games improve motivation and enjoy- 
ment, but they may impede learning, especially initially (Adams, 
Mayer, MacNamara, Koenig, & Wainess, 2012; Jackson et al., 
2012). In this case, we would expect the game-based environment 
to produce lower learning outcomes than the traditional ITS, 
particularly in the initial stages of learning. The second hypothesis 
is that the game-based components of iSTART-—ME improve mo- 
tivation and enjoyment (Cordova & Lepper, 1996; Papastergiou, 
2009), and this increase in affective measures mediates learning 
(Alexander et al., 1997). This hypothesis suggests that students in 
the game-based training should see improved motivation and en- 
joyment over time and should see a corresponding increase in 
performance during the later stages of training (compared with the 
traditional ITS). 


Procedure 


Participants and setting. Eighty-four high school students 
were recruited from the general city-wide high school population 
in an urban environment in the mid South (51% male; 81% African 
American, 13% White, 6% other; average grade completed = 10th 
grade; average age = 15.8 years). The 11-session experiment was 
conducted in a research laboratory on a large university campus 
and involved four phases: pretest, training, posttest, and retention 
test. 

Pretest. During the first session, students completed a pretest 
that included questions to collect basic demographics, prior moti- 
vation (including selected questions adapted from the Motivated 
Strategies for Learning Questionnaire, or MSLQ; Pintrich, Smith, 
Garcia, & Mckeachie, 1993), and an assessment of their prior 
ability to self-explain (described in more detail later). 
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Training. At beginning of the eight training sessions, partic- 
ipants completed a 12-item daily survey (see Measures section for 
details). After the daily survey, students then interacted with their 
randomly assigned between-subjects condition: a game-based sys- 
tem (GSTART-ME, n = 41) or a traditional ITS GSTART— 
Regular, n = 43). Students in the educational game condition 
interacted with the full game-based selection menu in 
iSTART-ME across eight separate sessions of at least | hr each. 
Participants in the ITS condition used the original non-game-based 
version of iSTART for the same amount of time (eight sessions of 
at least 1 hr each). 

The initial training within both conditions was identical until the 
participants transitioned into extended practice. That is, both con- 
ditions progressed through the Introduction Module, the Demon- 
stration Module, and then two regular practice texts within the 
Coached Practice environment. Students assigned to game-based 
training were then free to use the full selection menu (Figure 2), 
while the ITS students continually transitioned from one text to 
another within the Coached Practice environment (Figure 1). 

Like many ITSs, iSTART-Regular is not completely void of 
mechanisms and features that are commonly used within games. 
For example, isTART-—Regular displayed points for each self- 
explanation (near bottom-left of Figure 1), included adaptive feed- 
back from an animated agent and provided a trophy (or lack 
thereof) based on the performance within each text. These features 
(points, personalized feedback, animated characters, and trophies/ 
badges) are commonly used in numerous types of games and 
systems, both virtual and physical. Table 1 provides a more thor- 
ough comparison of the two training systems in terms of the key 
features included in each. iSTART—ME differed from iSTART-— 
Regular primarily in the presence of the selection menu, which 
allowed participants to play mini-games and modify certain as- 
pects of the environment (e.g., swap tutors, personalize their 
avatar). Both systems allowed students to progress through the 
tutoring at their own pace, and therefore, not all students experi- 
enced the same components at the same time. This is a key 
characteristic of ITSs and virtually all games that adapt interac- 
tions on the basis of user decisions. Hence, some students naturally 
receive more or different kinds of training and practice than others. 

Posttest and retention. All students completed the posttest 
and then a delayed retention test (completed a week after posttest). 
The posttest consisted of assessments similar to those from the 
pretest (details are discussed in the Measures section). These 
included measures of self-explanation ability and students’ moti- 
vation during the study, along with questions pertaining to stu- 
dents’ attitudes, perceptions, and experiences. The retention test 
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was used to assess the durability of students’ self-explanation 
skills after a 1-week delay without training. 


Measures 


Survey and performance measures were collected during pretest, 
training, posttest, and retention. These included measures related 
to self-explanation ability as well as students’ attitudes, motiva- 
tion, self-efficacy, and enjoyment. 

Self-explanation ability. Students’ performance on_ self- 
explanation tasks was collected during pretest, training, posttest, 
and retention. During training, students interacted with various 
texts and all self-explanations were scored through the iSTART 
assessment algorithm and recorded into a database. During each of 
the three testing phases, students were presented with a new text 
(not included within training) and prompted to self-explain spe- 
cific sentences (eight self-explanations during each test). These 
three texts were selected due to their similarity in terms of length 
(281-329 words), content difficulty (Grade Level 8-9), and lin- 
guistic features (i.e., similar scores on the five principal component 
scores within Coh-Metrix; Graesser, McNamara, & Kulikowich, 
2011). Each self-explanation provided by the students was scored 
using the iSTART assessment algorithm, the performance of 
which has been shown to be comparable to that of human scorers 
(Jackson, Guess, et al., 2010). Unfortunately, due to a technical 
error, the three texts were not automatically counterbalanced 
across the testing phases. Thus, despite extensive efforts to utilize 
equitable texts, comparisons of self-explanations across time 
should be interpreted with caution and must be replicated using 
appropriate methodology. Nonetheless, the lack of counterbalanc- 
ing should not affect any comparisons between conditions. 

Attitudes, motivation, self-efficacy, and enjoyment. Survey 
questions were included during pretest, posttest, and daily training 
sessions to assess students’ attitudes, motivation, self-efficacy, and 
enjoyment. Pretest and posttest measures included several ques- 
tions adapted from the MSLQ (Pintrich et al., 1993). The questions 
adapted from the MSLQ were selected to address students’ moti- 
vation and self-efficacy. In addition to these standardized mea- 
sures, questions were included from previous research with the 
iSTART system (Jackson, Davis, et al., 2011; Jackson, Graesser, 
& McNamara, 2009; Jackson & McNamara, 2011). These addi- 
tional questions were implemented within the pretest, posttest, and 
daily surveys and were designed to measure students’ self- 
assessments of motivation, expectations for system interactions, 
current affect and mood, and overall enjoyment of the system. 


Game Mechanism and Feature Differences Between iSTART-ME and iSTART-Regular 








Mechanisms iSTART-ME iSTART-Regular 

Feedback Points, local skill bar, verbal feedback from pedagogical agent, global skill bar, Points, local skill bar, verbal feedback from 
trophies, levels pedagogical agent 

Incentives Points, trophies (reviewable), levels, swap tutor, edit theme, edit character, play Points, trophies (viewed once after each 
mini-game text) 

Control Select next activity, edit environment & characters, generate self-explanations Generate self-explanations 

Task difficulty Increased difficulty for each new level (both in menu and in games), new texts New texts 

Environment Animated pedagogical agents, select new animated agent, edit theme, edit character, Animated pedagogical agents 


display trophy case, display performance for recent texts 
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Table 2 


Pretest and Posttest Survey Scales: Means (Standard Deviations) 
a ee lt 


Survey scales iSTART-ME iSTART-Regular F(1, 82) Pp 

Pretest survey measures 

Expected Enjoyment and Motivation to Participate (5 items) 4.73 (0.66) 4.69 (0.73) 0.07 .199 

Achievement Motivation and Learning Values* (7 items) 5.31 (0.98) 5.31 (0.86) 0.00 .996 

Self-Efficacy (3 items) 5.94 (1.12) 5.95 (0.91) 0.00 .962 

Competitiveness (2 items) 4.91 (1.08) 4.84 (1.21) 0.10 758 
Posttest survey measures 

Enjoyment and Motivation (6 items) 4.55 (1.09) 3.83 (1.20) 8.28 .005 

Perceived Learning, Effort, and Values (9 items) 4.81 (1.56) 4.53 (1.61) 0.64 425 

Ease of Use (4 items) 3.32 (1.04) 3315220) 0.00 BODIE 





“ Questions adapted from the Motivated Strategies for Learning Questionnaire (MSLQ; Pintrich, Smith, Garcia, & Mckeachie, 1993). 


The daily surveys used within the current study have been used 
in previous research with the game-based version of iSTART-ME 
(Jackson, Davis, et al., 2011; Jackson & McNamara, 2011). These 
surveys were designed to assess students’ moods, attitudes, and 
perceptions across the time frame of an experiment without being 
an invasive measurement during system interactions. These sur- 
veys were administered at the beginning of each training session 
and specifically addressed students’ experiences from the previous 
session (overall impression, enjoyment, boredom, frustration, in- 
terface problems, feeling of learning, feeling of improvement) and 
assessed their current attitudes and feelings (current mood, antic- 
ipation about participating, level of motivation, intention to per- 
form well, desire to do better than others). 


Results 


Training Sessions 


As mentioned previously, both systems allowed students to 
progress through training at their own speed. Despite the adaptivity 
and self-paced interactions, students’ prior ability was not related 
to the amount of practice students received during training. More 
specifically, self-explanation ability at pretest was not related to 
the number of practice texts that students completed (r = .079, p = 
.45). Therefore, initial ability levels were not related to the amount 
of extended practice that students received, and most students 
experienced the training components at approximately the same 
time. The vast majority of students completed the two regular 
practice texts and transitioned into the extended practice during the 
first (n = 6) or second (n = 72) session, while some students did 
not reach the extended practice section until the third (n = 5) or 
fourth (n = 1) session. Ultimately, all students completed the 
training modules and subsequently interacted with their randomly 
assigned training condition for the remainder of the study. 


Attitudes, Motivation, Self-Efficacy, and Enjoyment 


User experience measures from pretest questions, daily surveys, 
and posttest questions were analyzed to explore students’ attitudes, 
perceptions, and experiences within the two training systems. 
Analyses on the pretest survey questions indicate that there were 
no significant differences between conditions on questions prior to 
the start of training that related to enjoyment, motivation, self- 
efficacy, or competitiveness (see Table 2). 


The posttest survey included several questions related to enjoy- 
ment, perceived learning, and usability within the system (see Table 
2 for descriptive and analysis of variance [ANOVA] results). A 
posttest enjoyment and motivation composite score was created by 
averaging across six separate questions. An ANOVA on the enjoy- 
ment and motivation composite score yielded a significant effect of 
condition, F(1, 82) = 8.28, p = .005, mean square error (MSE) = 
1.15, Cohen’s d = 0.628. These results indicate that the game-based 
environment was rated as a significantly more positive experience 
than the traditional ITS. Additionally, a composite scale that assessed 
students’ perceived learning, effort, and values for the target system 
and materials found no significant differences between the game and 
nongame system. Likewise, a four-question scale that assessed system 
ease of use and interface confusion revealed no significant differences 
between conditions. These results suggest that the game-based selec- 
tion menu system was more enjoyable and motivating, but just as 
valuable and easy to use as the ITS. 

Daily surveys were administered to assess students’ reports of 
their previous-session experiences and current-session expecta- 
tions. Questions related to similar concepts were combined into 
several composite scores (i.e., enjoyment during the previous 
session, improvements in self-efficacy, and motivation for the 
current session). A composite score was created for enjoyment 
during the previous session by combining scores from the follow- 
ing three questions: “My most recent session was . . . (very bad = 
1, very good = 6),” “I enjoyed my most recent session . . . (not at 
all = 1, very much = 6),” and “I was bored during my most recent 
session ... (reversed scored; all the time = 1, never = 6).” A 
mixed-factor ANOVA on this composite score indicated that there 
was a Significant main effect for condition, such that students in 
the game-based condition rated their session experiences more 
favorably (M = 4.89, standard error [SE] = 0.159) than did 
students in the ITS condition (M = 4.07, SE = 0.151), FU, 76) = 
13.92, p < .001, MSE = 7.51. There was also a significant linear 
interaction between session and condition, F(1, 76) = 3.266, p = 
.004, MSE = 0.606 (see Figure 5).' Pairwise comparisons using 
Bonferroni adjustments for multiple tests confirmed that enjoy- 


'The mixed-factor ANOVA results presented here are based on the 
linear equation contrasts for the interaction. The overall within-subject 
interaction effects for this mixed-factor ANOVA (including Huynh—Feldt 
corrections due to significant sphericity and a large Greenhouse—Geisser 
€ > .75) were also significant, F(5.99, 455.26) = 3.27, p = .004. 
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Figure 5, Composite means for enjoyment questions about the previous 
session. ITS = intelligent tutoring system. 


ment tended to increase across sessions for students interacting 
with the game-based system and decrease for those in the ITS 
condition. Specifically, for students in the iSTART-—ME condition, 
overall enjoyment at Sessions 5 (¢ = 4.18, Pagjustea < -01) and 8 
(t = 5.03, Paajustea < -001) were significantly higher than at 
Session 1 (with the middle sessions being roughly equivalent). By 
contrast, students in the ITS condition provided their highest 
ratings during the first two sessions, with enjoyment at Session 2 
being significantly higher than at Session 4 (¢ = 3.40, Paajustea < 
.05). 

Similarly, a composite score was created for the daily survey 
questions related to students’ improvements in self-efficacy. This 
score combined student ratings for two questions about the previ- 
ous session, “I felt like I learned the material ... (not at all = 1, 
very much = 6)” and “J feel like my reading skills improved .. . 
(not at all = 1, very much = 6).” A mixed-factor ANOVA on the 
self-efficacy composite score did not indicate a significant main 
effect for condition, F(1, 76) = 2.50, p = .118, MSE = 7.43, but 
did reveal a significant linear interaction between session and 
condition, F(1, 76) = 2.91, p = .015, MSE = 0.673 (see Figure 
6).” This interaction reflects the finding that students’ reported 
self-efficacy increased across sessions if they had interacted with 
the game-based version of training and decreased if they interacted 
with the traditional ITS. Specifically, pairwise comparisons (using 
Bonferroni adjustments) showed that iSTART-ME students pro- 
vided their highest self-efficacy rating in the final session (Session 
8 was marginally higher than Session 4, t = 3.15, Daciusted 0): 
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Figure 6. Composite means for self-efficacy daily survey questions. ITS = 
intelligent tutoring system. 
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Figure 7. Composite scores for questions about students’ motivation to 
participate in the current session. ITS = intelligent tutoring system. 


In contrast, iSTART—Regular students provided their highest self- 
efficacy ratings in the first two sessions (Session | was signifi- 
cantly higher than Session 7, t = 3.35, Pagjustea < -05; and Session 
2 was significantly higher than both Session 4, t = 3.57, Daajustea < 
.05, and Session 7, t = 3.52, Daajustea < -05). 

Finally, a composite score was created for the daily survey 
questions that pertained to motivation to participate in current 
session: “My mood right now is ... (very negative = 1, very 
positive = 6),” “I am looking forward to participating in today’s 
session ... (not at all = 1, very much = 6),” “I am motivated to 
participate in today’s session . . . (not at all = 1, very much = 6),” 
and “I plan to do my best during today’s session . . . (not at all = 
1, very much = 6).” A mixed-factor ANOVA yielded a marginal 
main effect of condition, F(1, 75) = 3.05, p = .085, MSE = 5.82, 
indicating that students in the game-based condition (MV = 5.30, 
SE = 0.142) tended to be more motivated to participate than 
students interacting with the ITS (M = 4.96, SE = 0.133). This 
mixed-factor ANOVA also revealed a significant linear interaction 
between session and condition, F(1, 75) = 4.95, p = .029, MSE = 
0.410 (see Figure 7),° reflecting the finding that students’ motiva- 
tion to participate in the current session remained stable for those 
in the iSTART-ME condition but declined in the iSTART- 
Regular condition. Pairwise comparisons (using Bonferroni adjust- 
ments) confirmed that students’ ratings for today’s session within 
the game-based system were not significantly different across 
SeSSIONS (Pagjustea > -05), and students within the ITS provided 
marginally higher ratings in Sessions 1 and 2, compared with 
Sessions 4 (tse..j6n1 = 3:24, feeesioaa = 3.20, Paajustea ~ -10) and 7 
(Geession2 = 3-15, Paajustea < -10). 

These results collectively indicate that students provided equiv- 
alent ratings in the two conditions for the first two sessions (when 
training was the most similar), but after the game-based aspects 


* The mixed-factor ANOVA results presented here are based on the 
linear equation contrasts for the interaction. The overall within-subject 
interaction effects for this mixed-factor ANOVA (including Greenhouse— 
Geisser corrections due to significant sphericity and a moderate 
Greenhouse-Geisser € < .70) were also significant, F(4.75, 360.84) = 
2.91, p = .015. 

° The mixed-factor ANOVA results presented here are based on the 
linear equation contrasts for the interaction. The overall within-subject 
interaction effects for this mixed-factor ANOVA (including Huynh—Feldt 
corrections due to significant sphericity and a large Greenhouse—Geisser 
€ > .80) were also significant, F(6.22, 466.40) = 2.09, p = .050. 
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were made available, students interacting with the educational 
games provided more positive ratings than did students interacting 
with the ITS. In sum, the combined evidence from the daily 
surveys and posttest questions indicates that students preferred to 
interact with the game-based system more so than the traditional 
tutoring system. 


Learning Outcomes 


Analyses were conducted on the self-explanation scores from 
the pretest, posttest, and retention test. All self-explanations were 
scored using the iSTART assessment algorithm which has high 
correspondence to human scores (k = .646; Jackson, Guess, et al., 
2010; McNamara, Boonthum, Levinstein, & Millis, 2007). As 
shown in Figure 8, self-explanation quality improved from pretest 
to posttest for students in both conditions, and this increase in 
performance was maintained in a delayed retention test 1 week 
later, but there was no benefit for either condition. Specifically, a 
mixed-factor ANOVA confirmed a main effect of test, F(1, 82) = 
22.67, p < .001, MSE = 0.20, reflecting the finding that self- 
explanation quality scores did not differ from posttest to retention 
test (t < 1), but both the posttest (t = 7.19, p < .001) and retention 
test (t = 7.77, p < .001) were significantly higher than the pretest 
(see Figure 8). There was no effect of condition, F(1, 82) = 1.61, 
p = .21, MSE = 0.71, and no interaction between condition and 
test, F(1, 82) = 0.48, p = .49, MSE = 0.17. 

One of the limitations of this study is that the self-explanation 
texts were not counterbalanced among the pretest, posttest and 
retention test phases. Therefore, the pretest to posttest improve- 
ment in self-explanation ability is conflated with a potential text 
effect. However, combining these findings with the improvement 
of self-explanation ability during training lends support to stu- 
dents’ improvement between testing phases. Additionally, the lack 
of counterbalancing does not affect the comparisons between 
conditions at each phase. Thus, the equivalent performance be- 
tween conditions at each testing phase is not confounded by text. 

We also conducted analyses to examine self-explanation perfor- 
mance comparing conditions during extended training. The first 
training sessions included the complete Introduction and Demon- 
stration modules, along with the first two texts in regular practice 
(initial ~2 hr training). On average, students began interacting 
with the two different extended practice modules during the sec- 
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Figure 8. Self-explanation performance during testing phases (means and 
standard errors). ITS = intelligent tutoring system. 
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Figure 9. Frequency of self-explanations across sessions. ITS = intelli- 
gent tutoring system. 


ond session (i.e., students started using either only coached prac- 
tice or the full selection menu during the second session). A 
mixed-factor ANOVA on the frequency of self-explanations 
yielded significant main effects for both session, F(1, 67) = 20.15, 
p < .01, MSE = 35.71, and condition, F(1, 67) = 4.03, p < .05, 
MSE = 384.86, but revealed a nonsignificant interaction between 
session and condition, F(1, 67) = 1.13, p = .29, MSE = 46.18 (see 
Figure 9 for mean frequencies across days).* Students who inter- 
acted with the ITS produced more self-explanations (M = 23.39, 
SE = 1.14) during extended practice than did students within the 
game-based training (M = 20.03, SE = 1.23), and the number of 
self-explanations tended to be highest in Session 2. Pairwise com- 
parisons (using Bonferroni adjustments) indicate that students 
across conditions generated an equivalent number of self- 
explanations during regular practice (i.e., Session 1) and that the 
frequency of self-explanations during the first session was signif- 
icantly lower than in all other sessions (all ts > 6.60, all 
PSadjustea < -05). For training that took place within extended 
practice (i.e., Sessions 2—8), participants within the ITS produced 
significantly more self-explanations than students using the game- 
based system during Sessions 3 and 5 (fgession3 = 2-16, Paajustea = 
0353 tsessions = 2-04, Paajustea = -046) and were marginally higher 
for Sessions 2 and 8 (tgession2 = 1-97; Paajustea = 9533 tyessions = 
1.91, Paadjustea = -060). 

Further analyses examined students’ self-explanation quality as 
computed by the iSTART assessment algorithm across the eight 
training sessions (see Figure 10). The two main hypotheses in the 
current study predicted opposite slopes for game-based perfor- 
mance during the initial and later training sessions. Specifically, 
the first hypothesis predicted a negative slope for game-based 
performance during the early sessions, while the second hypothesis 
predicted a positive slope for game-based performance in later 
sessions (this predicted decrease followed by an increase was 
tested through a quadratic contrast). A mixed-factors ANOVA did 
not indicate a significant main effect for condition, F(1, 67) = 


+The mixed-factor ANOVA results presented here are based on the 
linear equation contrasts for the interaction. The overall within-subject 
interaction effects for this mixed-factor ANOVA (including 
Greenhouse-Geisser corrections due to significant sphericity and a 
moderate Greenhouse—Geisser € < .70) were marginally significant, 
F(4.62, 309.69) = 1.994, p = .085. 
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Figure 10. Self-explanation performance during training (means and 
standard error). ITS = intelligent tutoring system. 


2.89, p > .05, MSE = 1.27; however there was a significant 
quadratic interaction between session and condition, F(1, 67) = 
22.91, p < .001, MSE = 0.11.° Students’ self-explanation quality 
in the iSTART—ME condition tended to decrease during the initial 
interactions with the game-based selection menu and educational 
games (Sessions 3, 4, and 5), and Bonferroni-adjusted pairwise 
comparisons indicated that scores were significantly different be- 
tween the two training conditions for Sessions 3 (¢ = —2.05, 
Daren 0S) Woy O a 2s 11s peat 0), end.6 (f= —=2.21, 
Padjustea < -05). These results are partially attributable to the 
reduction in direct feedback from Merlin in Coached Practice. 
These trends may also be partially due to the additional cogni- 
tive tasks involved with learning the menu itself, its features, 
and various game dynamics, in addition to the targeted self- 
explanation strategies. Indeed, the analyses related to Figure 9 
demonstrate that students within the game-based system pro- 
duced fewer self-explanations and thus practiced less on the 
target task. Despite these extra features and time spent off-task 
(i.e., not practicing), the students within the game-based system 
were able to compensate for the initial deficit over time and 
ultimately rose to match the performance of the ITS partici- 
pants. It is important to note that the students within the 
game-based training were more motivated to participate, en- 
joyed interacting with the system more, and had larger improve- 
ments in self-efficacy than those students in the ITS condition, 
which would be a crucial factor in a real-life situation such as 
a classroom or practicing at home. 

These results concur with findings in two studies conducted by 
Jackson, Dempsey, et al. (2011, 2012), collectively suggesting that 
game elements have the potential to detract from learning during 
initial skill acquisition. However, game environments can provide 
a more positive experience over time. Thus, the game-based sys- 
tem investigated in this study appears to strike an appropriate 
balance between both learning and enjoyment, improving on the 
imbalance previously encountered within a traditional tutoring 
system. This finding is especially encouraging for strategy-based 
tutors that require long-term interactions for students to develop 
skill mastery. 


Discussion 


The goal of tutoring systems and educational games is to pro- 
duce effective and enjoyable learning experiences. However, if 
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students do not enjoy the experience, they are likely to cease or 
avoid further interactions, which is particularly detrimental to 
systems that require skill development over longer periods of 
training. These long-term skill acquisition systems must be de- 
signed to foster significant increases in mastery development, but 
they must also be enjoyable to use. In the case of strategy tutors, 
these systems must not only teach the strategies themselves but 
provide an effective, motivating practice environment where stu- 
dents can apply this training and sufficiently develop the target 
skills into more automatic and stable processes. For example, 
previous research with iSTART has illustrated the need for pro- 
longed training of at least 5 days with the system, such that 
students (specifically those with low prior abilities) have sufficient 
opportunities to apply and master the target skills (Jackson, Boon- 
thum, et al., 2010). Thus, a game-based version of the system 
(iSTART-MB) was designed to maintain higher levels of student 
motivation and engagement over an extended practice period by 
incorporating and leveraging mechanisms that positively influence 
affect (Conati, 2002; Corbett & Anderson, 2001; Cordova & 
Lepper, 1996; Graesser et al., 2009; Malone & Lepper, 1987; 
Moreno & Mayer, 2005; Shute, 2008). The current work focused 
on evaluating the global impacts of this game-based learning 
environment and comparing it to a traditional ITS. Additionally, in 
this study we investigated the specific time-based effects of these 
systems on both motivational and learning outcomes. 

Within the current study, the game-based version of training was 
preferred significantly more than the traditional tutoring system. 
The results from the posttest survey indicate that students per- 
ceived both systems to be equally helpful and easy to use but that 
the game-based system was significantly more motivating and 
enjoyable (Table 2). Likewise, results from the daily surveys 
(Figures 5—7) illustrate that students who interacted with the game- 
based system tended to improve in their perception of the system 
across sessions, have improved self-efficacy (compared with those 
interacting with the ITS), and slowly increase (or at least maintain) 
motivation for future interactions. In contrast, daily ratings by 
students who interacted with the traditional tutoring system de- 
creased in enjoyment, motivation, self-efficacy, and desire for 
future interactions. The game components present within 
iSTART—ME seem to be activating related motivational constructs 
that remain effective across time. These trends are also fairly 
gradual, indicating that changes may occur in smaller increments 
and slowly build up with more iterative interactions (possibly 
suggesting a cycle of affective improvement across time). 

The results for self-explanation performance (the targeted skill) 
provide a more complex message. The self-explanation frequen- 
cies and means (see Figures 9 and 10) help to provide significant 
insight into the learning trajectories comparing game-based and 
traditional tutoring systems. Students within game-based training 
generated significantly fewer self-explanations than students using 
the ITS. This difference is likely due to time spent with the 
additional nongenerative activities available within the game- 
based selection menu (i.e., mini-games, personalizing their avatar, 


° The mixed-factor ANOVA results presented here are based on the 
contrasts for a quadratic interaction. The overall within-subject interaction 
effects for this mixed-factor ANOVA (including Greenhouse—Geisser cor- 
rections due to significant sphericity and a moderate Greenhouse—Geisser 
€ < .75) were also significant, F(5.00, 335.18) = 4.23, p = .001. 
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selecting a character, changing the interface colors, and so on). 
Despite the increased number of self-explanations for the ITS 
condition, both training systems produced equivalent self- 
explanation performance at the posttest and delayed retention test 
(Figure 8). 

Based on the analyses for Figure 10, it appears that the tradi- 
tional ITS system showed a predominantly positive relation be- 
tween the amount of training and performance. This trend was 
expected, based on past research substantiating the positive bene- 
fits of iSTART training over time (Jackson, Boonthum, et al., 
2010; McNamara, O’Reilly, Rowe, Boonthum, & Levinstein, 
2007). The trajectories for self-explanation quality within the 
game-based system allow us to address the primary hypotheses. 
Our two main hypotheses regarded the potential benefits or hin- 
drances from the game-based version of training. Specifically, the 
first hypothesis predicted that the addition of game-based features 
may detract from the learning objectives, such that students should 
exhibit a decrease in performance during the initial stages of 
training. Indeed, the decrease in performance during Sessions 3 
through 5 (Figure 10) suggest that the game-based features may 
initially detract or interfere with students’ ability to apply the 
target strategies (possibly due to competing stimuli and accommo- 
dating multiple goals). 

The second hypothesis predicted that game-based features 
should improve motivation and engagement during prolonged pe- 
riods of training, which should have a corresponding increase in 
applied mastery (i.e., increased performance during later practice). 
The increase in self-explanation quality across Sessions 6 through 
8 (Figure 10) lends support for this second hypothesis. Specifi- 
cally, the increase in performance during these sessions corre- 
sponds with the improved affect and motivation ratings in the 
game-based condition (Figures 5-7). 

Analyses on the self-explanation quality across time indicated 
that the game-based system resulted in a significant quadratic 
relation between training and performance, such that performance 
initially declined and subsequently increased. This curvilinear 
performance trajectory provides statistical support for both hy- 
potheses and may also help to explain some of the mixed results 
found in the previous literature on educational games. Specifically, 
the time scale of measurement within a study may determine 
whether performance trends for game-based systems appear to be 
positive, negative, or neutral. 

It is also worth noting that the minimal game features remaining 
in the ITS (see Table 1) were not enough to produce the same 
motivational improvements as the fully game-based version of 
training. This finding is potentially significant for two reasons. 
First, the fully game-based training likely would have produced 
even larger motivational differences if it had been compared with 
a more stripped-down version of an ITS (i.e., exaggerating the 
already significant effects). Second, just adding in a few game-like 
features to an ITS is not enough to produce the effects found in 
more coherent and contextually bound educational games. Our 
findings in this study demonstrate that the combined set of features 
and mechanisms integrated within our game-based system (feed- 
back, incentives, control, task difficulty, and environment) effec- 
tively enhanced users’ experience with the tutoring system and that 
most of these benefits tended to remain stable or even increase 
across time. The overall findings indicate that game-based inter- 
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action mechanisms can provide enjoyable, effective interactions 
that promote sustained motivation and mastery over time. 

The current results further suggest that future research on edu- 
cational games should incorporate multiple time scales of mea- 
surement to investigate the complex trajectories of both learning 
and motivation within these environments. As suggested from the 
current work, along with results from Jackson, Dempsey, et al. 
(2011, 2012), isolated measurements solely at pretest and posttest 
may provide an oversimplified snapshot of the potential benefits 
(or weaknesses) of game-based education. The current work is 
intended to inform researchers’ future development and evalua- 
tion, as well as contribute to the need for empirical comparisons 
between game-based and nongame-based tutoring environments 
(O’Neil & Fisher, 2004, O’ Neill et al., 2005). 

The outcomes and concepts discussed here provide unique in- 
sight into various time-based effects within educational games. 
Repeated observations allow us to represent students’ experiences 
throughout the interaction process and are further supported with 
more summative measures collected separately from training (1.e., 
posttest and delayed retention). The results from this study as well 
as our previous studies (Jackson, Dempsey, et al., 2011, 2012) 
support the assumption that students prefer working with game- 
based tutoring environments and that, over time, these systems can 
provide enjoyable training that produces learning outcomes com- 
parable to more traditional ITSs. The current work provides sub- 
stantial support for incorporating games into long-term tutoring 
environments and should help researchers and educators to better 
understand the potential benefits from these game-based compo- 
nents and systems. 
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The present research examined how mode of play in an educational mathematics video game impacts 
learning, performance, and motivation. The game was designed for the practice and automation of 
arithmetic skills to increase fluency and was adapted to allow for individual, competitive, or collaborative 
game play. Participants (V = 58) from urban middle schools were randomly assigned to each experi- 
mental condition. Results suggested that, in comparison to individual play, competition increased 
in-game learning, whereas collaboration decreased performance during the experimental play session. 
Although out-of-game math fluency improved overall, it did not vary by condition. Furthermore, 
competition and collaboration elicited greater situational interest and enjoyment and invoked a stronger 
mastery goal orientation. Additionally, collaboration resulted in stronger intentions to play the game 
again and to recommend it to others. Results are discussed in terms of the potential for mathematics 
learning games and technology to increase student learning and motivation and to demonstrate how 


different modes of engagement can inform the instructional design of such games. 


Keywords: achievement goal orientations, games, learning 


The past decade has seen an intensifying interest in the use of 
digital games in pursuit of educational goals. Entertainment 
games, whether they run on a computer, game console, mobile 
device, or touch pad, are highly engaging and motivating, and 
educators have suggested taking advantage of these qualities of 
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games to facilitate learning (Gee, 2007; Kafai, 1995; Squire, 
2003). Proponents of digital game-based learning have argued that 
well-designed games embody educational and learning theory and 
are in line with some of the “best practices” of education (e.g., 
Barab, Ingram-Goble, & Warren, 2008; Collins & Halverson, 
2009; Gee, 2003; Mayo, 2007; Shaffer, 2008; Squire, 2008). 

The validation of claims that good games are good for learning 
first leads us to consider the question: What is a good game? There 
are many aspects of the design of a digital game that can impact 
the game’s educational effectiveness. For example, game design- 
ers make decisions regarding the game’s core mechanic (Plass et 
al., in press; Salen & Zimmerman, 2003), the representation of the 
game content (Plass et al., 2009), the emotional design of the game 
(Um, Plass, Hayward, & Homer, 2012), the game’s incentive 
system, and social aspects of play (Salen & Zimmerman, 2003). 
Research on the design of good games for learning therefore 
examines the effects of key features of games on students’ learning 
experiences and outcomes (Plass, Homer, & Hayward, 2009). The 
goal of this line of research is to investigate whether effects of 
social, cognitive, and affective factors related to learning found in 
research on other learning environments can be extended to the 
design of games for learning and used to develop theory-based, 
empirically validated design patterns for such games. Design pat- 
terns, originally proposed in the context of architecture (Alexan- 
der, Ishikawa, & Silverstein, 1977), represent general solutions to 
commonly occurring problems that educational game designers 
can use to guide the design of specific aspects of their games. 

In the present study, we examined one of these design pat- 
terns—the context of playing a game—to increase arithmetic flu- 
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ency. Middle-school students were randomly assigned to play an 
arithmetic game, FactorReactor, developed by the Games for 
Learning Institute for the purpose of this research. They played 
either on their own (individual), against another student (compet- 
itive), or together with another student (collaborative). Learning, 
performance, achievement goal orientations, interest, enjoyment, 
and future game intentions were examined as a function of mode 
of play. 


Theoretical Background 


In the present research, we were interested in how three modes 
of play (individual, competitive, and collaborative) affect learning, 
game performance, and motivation. The conceptual framework for 
this research consists of the educational context of learning, 
achievement goal theory, and interest for which we review related 
research in this section. 


Educational Contexts 


It has long been established that social context generally, and 
peer interaction specifically, impact the learning process and that 
knowledge construction is a social, collaborative process (Light & 
Littleton, 1999; Piaget, 1932; Salomon, 1993; Scardamalia & 
Bereiter, 1991; Vygotsky, 1978). Research on the social context of 
learning has found that peer involvement in learning can affect 
both academic achievement as well as learner attitudes in a variety 
of contexts. Early work on cooperative learning in the classroom 
context suggests that peer collaboration may have positive effects 
on academic achievement across a variety of content areas (Berg, 
1994; Dillenbourg, 1999; Slavin, 1980, 1983; Slavin, Leavey, & 
Madden, 1984). Cooperative learning has also been found to 
increase positive attitudes toward school generally and mathemat- 
ics as a subject area (Slavin, 1980; Slavin et al., 1984). Research 
on competition suggests that learning and performance are better 
in competitive compared with individual settings (Ames, 1984) 
and that competitive features result in the development of analytic 
skills (Fu, Wu, & Ho, 2009), but not always in increased learning 
outcomes (Ke & Grabowski, 2007). 


Collaboration 


Group collaboration can take a variety of forms and has been 
investigated in a broad range of contexts, including classroom- 
based learning (Berg, 1994), computer-based learning (R. T. John- 
son, Johnson, & Stanne, 1986; Mevarech, Stern, & Levita, 1987; 
Scardamalia & Bereiter, 1991), and web-based and e-learning 
(Hron & Friedrich, 2003). What these collaborations have in 
common is that two or more learners interact in a synchronous 
form to negotiate shared meaning and jointly and continuously 
solve problems (Dillenbourg, 1999). 

The recent surge in interest in digital games as tools for learning 
offers up a new forum for investigating learning as a social 
activity. Initial research has provided thick descriptions and case 
studies of such collaborative activities in learning with games and 
related activities (Barab, Thomas, Dodge, Carteaux, & Tuzun, 
2005; Squire, 2005; Steinkuehler, 2006). In comparison to indi- 
vidual study, group collaboration appears to be well suited for 
problem solving because collaboration encourages students to ex- 
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plain their thinking, verbalize it, and engage in joint elaboration on 
their decision making (Mullins, Rummel, & Spada, 2011). In 
addition, Kirschner, Paas, Kirschner, and Janssen (2011) showed 
that students working in small groups were better able to handle 
the cognitive load demands of problems with complex informa- 
tion, and thus learned more efficiently, than students solving 
problems in individual work. 

The effects of collaboration only accrue, however, when certain 
conditions are met. In fact, a meta-analysis by Lou et al. (1996) 
found that collaboration did not have an effect in about one fourth 
of the studies, and in some cases even had a negative impact. Some 
of the conditions for the effectiveness of group collaboration are so 
fundamental that many consider them part of the definition of 
collaborative learning: Group members must have a shared group 
goal that they deem important, and the success of the activity must 
depend on all members of the group; that is, each member must be 
individually accountable (Slavin, 1988). 

In addition to these fundamental conditions, additional ways to 
support group collaboration have been explored. Berg (1994), for 
example, used collaboration scripts to facilitate group collabora- 
tion, and Hron, Hesse, Cress, and Giovis (2000) showed that 
structuring the dialogue in group collaboration enhanced learners’ 
orientation to the subject matter and reduced off-task conversation, 
though it did not increase knowledge gains. Other ways to assure 
the success of collaborative learning includes providing students 
with visualization tools (Fischer, Bruhn, Grasel, & Mandl, 2002), 
managing the cognitive load they experience (van Bruggen, 
Kirschner, & Jochems, 2002), and providing adaptive support from 
intelligent tutors (Diziol, Walker, Rummel, & Koedinger, 2010) 
and from interactive dialogue agents (Chaudhuri et al., 2008). 

The beneficial performance effects of collaboration only appear 
to be present for tasks involving conceptual knowledge, but not for 
procedural skill fluency (Mullins et al., 2011). In their research, 
Mullins et al. (2011) found that collaboration improved learning 
for both conceptual and procedural (skill fluency) material but that 
students in the procedural skill task engaged in ineffective learning 
behaviors. This is supported by other studies of group collabora- 
tion on learning involving conceptual knowledge that found that 
students provide explanations to one another (Diziol, Rummel, 
Spada, & McLaren, 2007) and engage in joint elaboration and 
co-construction of knowledge (Berg, 1994). The same kind of 
elaboration was not found in procedural skills acquisition. 

In the present study, we were interested in investigating collab- 
oration on a game-based task of arithmetic fluency development. 
Even though research so far has not shown clear benefits of 
collaboration for skills automation, other research suggests that 
conceptual knowledge and skills acquisition are linked, and the 
development of one can benefit the other (Rittle-Johnson & Ali- 
bali, 1999; Rittle-Johnson, Siegler, & Alibali, 2001). 

Arithmetic skills development begins in early childhood and 
continues throughout formal and informal schooling with the goal 
of becoming automated, but even adults often still use strategies to 
solve basic problems of addition, subtraction, multiplication, and 
division rather than retrieving basic arithmetic facts from long- 
term memory (Tronsky, 2005). The adaptive strategy choice model 
developed by Siegler and colleagues (Lemaire & Siegler, 1995; 
Shrager & Siegler, 1998) describes the development of strategy 
use along four dimensions as arithmetic experience increases. 
These dimensions include (a) which strategies are available to the 
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learner, (b) when a particular strategy is used, (c) how that strategy 
is executed, and (e) the decisions governing which strategy is 
chosen. As learners encounter arithmetic problems, they select and 
carry out a strategy to solve the problem and, in the process, 
accumulate data on the effectiveness of their strategy on multiple 
levels (Shrager & Siegler, 1998). Research has shown that, over 
time, automation (i.e., retrieval of the correct answer from mem- 
ory) becomes the dominant strategy because it yields highest 
accuracy rates and shortest response times (Tronsky, 2005). 

FactorReactor was designed to support this skill automation in 
middle-school-age children by providing arithmetic problems that 
increase in difficulty from one level to the next. Small-group 
collaboration was found to be beneficial in the classroom even for 
the development of arithmetic skills (Yackel, Cobb, & Wood, 
1991), and we were interested in whether a collaborative mode in 
the game would result in higher performance compared with an 
individual play mode. 


Competition 


A common element of video games is a competitive mode in 
which players compete with one another. In some cases, this 
competition means that two or more players compete for the same 
goal, such as in the table tennis game in Wii Sports Resort. In other 
cases, both players play the same game individually but are aware 
of each other’s progress and score, such as in the bowling game in 
Wii Sports Resort. 

Many studies investigating the effect of competitive forms of 
learning compare various social modes with an individual mode. A 
meta-analysis of 122 studies, comparing the effects of individual, 
competitive, and collaborative goal structures on achievement, 
found benefits for collaborative compared with competitive or 
individual goal structures (D. W. Johnson, Johnson, Maruyama, 
Nelson, & Skon, 1981). In a related study, R. T. Johnson, Johnson, 
and Stanne (1986) compared the effect of computer-assisted co- 
operative, competitive, and individual learning on performance 
and attitudes. Eighth graders were randomly assigned to work in 
either a small group, in the cooperative and competitive condi- 
tions, or individually to learn about fundamentals of map reading 
and navigation. Students in the cooperative condition were found 
to show the highest performance on daily worksheets. However, 
both the cooperative and competitive groups had higher levels of 
interest in computers at the close of the study, as compared with 
those who worked individually. More recently, Fu et al. (2009) 
investigated the knowledge creation process in a web-based learn- 
ing environment concerning computer software. The authors pre- 
dicted that the social presence of peers, in the form of a partner, 
would increase performance as well as enjoyment motivation. Four 
conditions were compared in which the collaborative (presence vs. 
absence of a partner) and competitive (presence vs. absence of 
financial reward and grade feedback) features of group learning of 
undergraduate students were systematically varied. Results indi- 
cate that both the collaborative and competitive features increased 
enjoyment in learning. When competitive features were present, 
students demonstrated higher analytic skills, or the separation of 
concepts into component parts as a means to understand organi- 
zational structure, as defined by Bloom’s (1956) taxonomy. The 
collaborative feature encouraged higher synthetic skills, or the 
building of structure from information, and therefore was indica- 
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tive of higher level learning. The authors concluded that both 
collaborative and competitive elements worked to bolster perfor- 
mance in a web-based environment. 

Strommen (1993) compared cooperative and competitive con- 
texts in learning from a computer-based natural science game 
among fourth graders. Students in the collaborative condition were 
found to be more successful in their game performance and used 
more game play strategies as compared with those in the compet- 
itive condition. Ke and Grabowski (2007) used a math computer 
game addressing measurement, whole numbers, equations, and 
graphing to examine the impact of cooperative game play, indi- 
vidual play, and competitive game play in fifth graders. After eight 
40-min game play sessions, there was no difference between the 
cooperative and competitive conditions in achievement, as mea- 
sured by multiple-choice arithmetic test. However, students in the 
cooperative condition demonstrated more positive math attitudes 
at the close of the study as compared with those in the competitive 
condition, further suggesting that the presence of peers when 
learning impacts attitudes toward academic content. 

Because competition is a common element of games, and be- 
cause some research suggests that performance is better in com- 
petitive compared with individual settings (Ames, 1984), we were 
interested in how learning and performance in the competitive play 
version of FactorReactor compared with individual play. 


Achievement Goal Orientations 


The structure of learning environments and the tasks used to 
engage learners can elicit particular achievement goals that can 
either facilitate or hinder learning (Ames, 1992; Meece, Ander- 
man, & Anderman, 2006). Similarly, modes of play may influence 
the adoption of particular goal orientations. Achievement goal 
theory posits two major types of goal orientations people endorse 
in achievement situations: mastery and performance (Ames & 
Archer, 1988; Dweck & Leggett, 1988; Elliot, 2005). A mastery 
goal orientation focuses on learning and the development of abil- 
ities, and success is defined in terms of personal improvement. In 
contrast, performance goal orientations focus on demonstrating or 
validating abilities, and success is defined in terms of performing 
well compared with others (Elliot, 1999, 2005). It is distinct from 
competition, however, in that outperforming others is a means of 
demonstrating or validating abilities rather than being the goal in 
and of itself. Performance goals can further be subdivided into 
approach and avoidance dimensions (Cury, Elliot, Da Fonseca, & 
Moller, 2006; Elliot, 2005; Elliot & McGregor, 2001). A 
performance-approach goal orientation focuses on performing well 
compared with others, whereas a performance-avoidance goal 
orientation is concerned with evading the appearance of incompe- 
tence and performing poorly relative to others. This approach- 
avoidance distinction has also been made with regard to mastery 
goals (Elliot, 1999; Elliot & McGregor, 2001); however, there is 
less empirical support for it (Maehr & Zusho, 2009). Therefore, we 
used the trichotomous model in the present research, assessing 
mastery-approach (which we refer to as mastery), performance- 
approach, and performance-avoidance goals among learners. 

In general, research has found that mastery goal orientations 
result in highly adaptive patterns of motivation and learning 
(Midgley, Kaplan, & Middleton, 2001). For example, they are 
associated with high levels of effort and persistence (Grant & 
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Dweck, 2003), particularly on difficult tasks (Elliott & Dweck, 
1988; Stipek & Kowalski, 1989), increased task involvement (Har- 
ackiewicz, Barron, Tauer, Carter, & Elliot, 2000), and increased 
self-efficacy (Meece, Blumenfeld, & Hoyle, 1988; Midgley et al., 
1998). Moreover, mastery goal orientations are associated with 
enhanced learning strategies that lead to better understanding of 
concepts and recall (Ames & Archer, 1988; Elliot & McGregor, 
2001; Grant & Dweck, 2003). Although performance-approach 
goals can also have adaptive outcomes, such as high academic 
achievement (Harackiewicz, Barron, Pintrich, Elliot, & Thrash, 
2002), these benefits can be accompanied by test anxiety (Linnen- 
brink, 2005; Skaalvik, 1997), cheating (Tas & Tekkaya, 2010), and 
the avoidance of help seeking (Karabenick, 2004). In contrast, 
performance-avoidance goals are consistently found to result in 
maladaptive motivational outcomes (Elliot & Mapes, 2005; Har- 
ackiewicz et al., 2002; Midgley et al., 2001). They are associated 
with lower achievement, intrinsic motivation, academic self- 
efficacy, and engagement (e.g., Church, Elliot, & Gable, 2001; 
Elliot & McGregor, 1999; Middleton & Midgley, 1997; Pekrun, 
Elliot, & Maier, 2009; Skaalvik, 1997). 

Taken together, mastery goal orientations provide the most 
adaptive framework from which to pursue educational goals, and 
contexts structured to invoke these goals have the potential to 
benefit student motivation in the long run. For example, O’ Keefe, 
Ben-Eliyahu, and Linnenbrink-Garcia (2013) found that a mastery- 
structured learning environment not only attenuated students’ 
performance-approach and -avoidance goal orientations but also 
augmented mastery goal orientations. Furthermore, the observed 
increases in mastery goal orientations were sustained 6 months 
after students had returned to more traditional, performance- 
oriented learning environments. 

In the present research, we examined how playing an educa- 
tional game by oneself, in competition with another, or collabora- 
tively results in the adoption of various achievement goal orien- 
tations. Given their influence on the adaptiveness of motivational 
and learning patterns, in the present study we intended to shed light 
on how the design and implementation of educational games can 
result in optimal motivational outcomes. A study by Ames (1984) 
found that working individually on a set of puzzles led children to 
attribute their level of performance to the effort they had expended, 
whereas those working competitively attributed their performance 
to their level of ability. Given that these attributional patterns map 
onto mastery and performance goal orientations, respectively 
(Dweck, 1986; Dweck & Leggett, 1988), we might expect that 
performance goal orientations would be adopted more strongly in 
the competitive condition relative to the individual play condition. 
The context of a game, however, may change the meaning of 
competition. Although games can heighten concerns about perfor- 
mance, they do not necessarily heighten concerns about the dem- 
onstration or validation of normative ability. Instead, educational 
games, such as the one employed in the present research, are 
designed to produce incremental personal success, which is in line 
with a mastery goal orientation. Therefore, we expected that play- 
ing competitively would increase mastery goal orientations as 
compared to individual play and that performance goal orienta- 
tions would not be affected by the competitive game context. 

We expected the collaborative condition to have a similar effect 
on players’ mastery goal orientations. A study by Ames and Felker 
(1979) examining children’s attributions regarding the achieve- 
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ment outcomes of another student found that ability attributions 
were stronger for those who worked individually and competi- 
tively than collaboratively. Similarly, effort attributions were 
stronger for individual and competitive work than successful (as 
compared with unsuccessful) collaborations. These results would 
suggest an ambiguous prediction regarding the adoption of 
achievement goal orientations in the types of contexts that are 
traditionally examined. However, the context of an educational 
game is different than the nongame contexts that are typically 
studied, largely because it provides a framework for incremental 
personal improvement. Accordingly, we expected that collabora- 
tive play would invoke stronger mastery goal orientations com- 
pared to individual play, and that performance goal orientations 
would remain unaffected. 


Interest 


Mode of play should similarly have an impact on players’ 
interest in the game. First, it is useful to distinguish two general 
types of interest. Individual interest refers to an intrinsic desire and 
tendency to engage in particular ideas, content, and activities over 
time. For example, someone with an individual interest in sports 
may watch games on television, read up on player stats, or play in 
a competitive athletic league, and engage in these activities on a 
relatively regular basis. Situational interest, in contrast, refers to 
the attentional and affective reactions elicited by the environment 
(e.g., Hidi & Renninger, 2006; Linnenbrink-Garcia et al., 2010). 
For instance, a physics instructor explaining how rockets work 
may not elicit much situational interest in his or her students using 
traditional lecture methods; however, he or she would likely elicit 
high situational interest by having students build and launch their 
own rockets. Although situational interest involves elements that 
include feelings of excitement and fascination, it is distinct from 
other constructs, such as enjoyment, in that it also includes ele- 
ments relating to the personal value of the interest object or 
involvement in the activity. 

Situational interest is of particular importance in education 
because it is essential to the development of individual interest. 
According to Hidi and Renninger’s (2006) four phase model, once 
situational interest is triggered, it can be maintained when personal 
relevance or involvement is established. Individual interest begins 
to emerge when the individual develops a relatively persistent 
predisposition to reengage in particular ideas, content, or activities. 
Finally, well-developed individual interest emerges once contex- 
tual supports are no longer necessary, such that the interest is 
generally, but not exclusively, self-generated. In the present study, 
we were interested in how the modes of play, particularly com- 
petitive and collaborative, influence situational interest, as it may 
suggest how games for learning can be designed and implemented 
to effectively elicit situational interest, and ultimately develop into 
individual interest in academic topics. 

Although one of the defining characteristics of games is to elicit 
situational interest (Salen & Zimmerman, 2003), the extent to 
which individual, competitive, and collaborative modes of play 
contribute to its invocation has not yet been examined experimen- 
tally. We expected that competitive and collaborative play modes 
would elicit greater situational interest than playing alone due to 
the social aspect of playing against or with a partner. These social 
contexts should enhance the excitement of game play, as well as 
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personal involvement. Additionally, we expected that other indi- 
cators of interest and motivation would reflect this prediction, such 
that the competitive and collaborative conditions should lead to 
greater enjoyment of the game, as well as a greater likelihood of 
future game reengagement and recommending the game to others. 


The Present Study 


In the present study, we aimed to investigate how three modes 
of play (individual, competitive, and collaborative) affect learning, 
game performance, and motivation. As discussed above, social 
educational contexts, such as competition and collaboration, have 
been shown to affect learning in a variety of settings, such as 
classrooms and web-based environments, for different age groups, 
and for different levels of learning objectives. It is of great interest 
to game designers and motivation theorists alike whether similar 
effects can be found for digital games designed for educational 
purposes. We, therefore, investigated how competitive and collab- 
orative modes of play compared with individual play in impacting 
learning, performance, achievement goal orientations, situational 
interest, enjoyment, and intentions to reengage in the game and 
recommend it to others. Our focus on these outcomes reflects the 
important intentions of using games for educational purposes, such 
that they have the potential to improve performance and increase 
engagement in educational activities. For the present research, we 
used FactorReactor, a game designed to practice and automate 
arithmetic skills to increase arithmetic fluency in middle-school- 
age students. 


Method 


Participants. Participants were 58 sixth-, seventh-, and 
eighth-grade students (58.6% female) from seven urban public 
schools in a major northeastern city. All students were taking part 
in a technology-themed afterschool program led by a teacher at 
their school. Membership in each of the programs was small and 
voluntary. In partnership with these programs, researchers made 
weekly visits to each school during the academic year to introduce 
students to educational technologies and games. In one of the 
sessions, students participated in the present study. The mean 
age of the students was 11.02 years (SD = 3.61). Missing data 
were handled through listwise deletion. 

Procedure. Before students arrived to the classroom in which 
the study was run, tables were arranged so that computer stations 
could be set up sufficiently far apart from one another. When 
students arrived, their assent and parental consent was collected, 
and were then seated at a computer station. They first watched an 
instructional video on their computers that provided an overview 
of the rules and goals of the game, FactorReactor, as well as how 
to use the Xbox game controller. Computer monitors were either 
13 or 15 in. (33 or 38.1 cm). All participants then played a practice 
round of the game individually for 5 min. During this time, they 
were provided with a controller schematic sheet to assist in learn- 
ing the operation of the controller. At the end of the practice 
session, an experimenter was available to the students to clarify 
any issues regarding the game, and the controller schematic sheet 
was taken away. Next, all participants played the game indepen- 
dently for 3 min, which constituted the pretest of game perfor- 
mance. 
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Students were then randomly assigned to one of three modes of 
play: individual (n = 16), competitive (n = 20), and collaborative 
(n = 22). This also meant that partners in the competitive and 
collaborative conditions were random. Participants in the individ- 
ual condition were situated in front of a laptop computer with a 
single controller, whereas the competitive and collaborative con- 
ditions joined with a partner in front of a laptop computer with two 
controllers. Before beginning, an experimenter provided the con- 
text for the experimental game play and specific instructions to the 
students. Those in the individual condition were told that they 
would be playing the same version of the game as before and were 
given the following instructions: “When playing the game, get the 
best score you can.” Those in the competitive condition were told 
that they would be playing a version of the game that allowed two 
players to compete against each other and were given the follow- 
ing instructions: “When playing the game, compete against each 
other for the better score.” Those in the collaborative condition 
were told that they would be playing a version of the game that 
allowed two players to play together and were given the following 
instructions: “When playing the game, work together to get the 
best score.” Instructions to learners regarding how to collaborate 
were kept relatively short for three reasons: First, middle-school- 
age students are used to playing games without receiving elaborate 
instructions and would likely have skipped any instructions pro- 
vided to them. Second, models of mathematics learning describe 
students as active learners who spontaneously create their own 
strategies to solve a problem (Cobb, Wood, & Yackel, 1991), and 
we did not want to stifle this invention of strategies by prescribing 
the process of collaboration. Finally, critical reviews of studies 
involving various forms of scaffolding have argued that perfor- 
mance differences between the individual and collaborative group 
found in such studies could have been attributable to the fact that 
the scaffolding (elaboration scripts, dialogue scaffolding, visual- 
izations) was only given to the collaborative group (Mullins, 
Rummel, & Spada, 2010). 

Participants were given 15 min to play, at which point the game 
automatically stopped. Figure 1 shows screen shots of the game in 
the three play modes. Participants then played another 3-min 
individual play session as a posttest of game performance. At the 
end of game play, participants were independently administered 
surveys assessing game-relevant achievement goal orientations, 
situational interest in the game, game enjoyment, future intentions 
regarding the game, and their degree of experience with video 
game controllers. Finally, they completed another individual 3-min 
play session. 


Materials 


FactorReactor. FactorReactor is a game designed to practice 
and automate arithmetic skills, and was adapted from the original 
version to investigate cognitive and motivational outcomes related 
to mode of play. The game runs on a PC and is played with an 
Xbox controller connected to the PC via USB cable. Figure 1 
shows screen shots of the game for each mode of play. Arithmetic 
fluency was chosen because it was identified by many teachers in 
our collaborating middle schools as a key skill on which other 
skills from the common core standards in Grades 6—8 build, but 
which is not sufficiently developed in many middle-school stu- 
dents. 
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A) Single Player 





B) Competitive 





Figure 1. 
play. FactorReactor by Murphy Stein and Games for Learning Institute is licensed under a Creative Commons 
Attribution-NonCommercial-ShareAlike 3.0 unported license. 


FactorReactor possesses the key defining elements of a game: It 
has a clear goal and clear rules of play, has an engaging game 
mechanic that allows for a high degree of player choice, provides 
feedback and incentives, and has a fail state (Salen & Zimmerman, 


2003). The object of the game was to transform the center number 


into one of the surrounding goal numbers by adding, subtracting, 


multiplying, or dividing it by one of the numbers from the inner 


ring. This was conducted by selecting one of the operators (+, —, 
x, +) and one of the inner numbers and then hitting the fire 
button. For example, if the goal number was 7 and the center 
number was 2, as it is in Figure 1A, the player might select 5 from 
the inner ring, and then choose the “+” operator. By pressing the 
fire button, 5 would be added to the center number, transforming 
it to 7. A subsequent press of the fire button then solved the 
problem and automatically advanced to the next goal number. 
When transformations were done correctly, the center ring turned 
from red to green. If incorrect or impossible transformations were 
attempted (e.g., dividing the center number so that it does not 
result in a whole number, such as 2+5), the center number would 
temporarily glow and jiggle. Players had full control over which 
goal number they worked on at any given time, however, affording 
considerable flexibility in solving each problem. 

Each time the center number was correctly transformed, the 
player earned a token, called a “ring,” and each player began the 
game with 10 rings. The number of rings rewarded for a correct 
transformation was equal to 2 times the minimum possible number 
of transformations for the relevant solution. Players could be 
awarded between two and eight rings, depending on the problem 


C) Collaborative 





Three modes of play in FactorReactor: A: individual play. B: competitive play. C: collaborative 


that was solved (i.e., problems required, at a minimum, between 
one and four steps to be solved). Rings are used up with each 
operation, such that when a player hits the fire button, a ring is 
used; therefore, if a player attempts to transform the center number 
using multiple operations, multiple rings are used. In this way, the 
game disincentivized players from reaching their goal by guessing 
or repeating the same simple operation again and again (e.g., 
repeatedly subtracting a small number) and encouraged them to 
use more complex operations to solve problems in fewer moves. 
Scores were also calculated, which were highest for those who 
solved each problem using the least amount of rings. The level 
ended when all goal numbers were computed properly and at least 
one ring remained. 

Levels increased in difficulty, such that the operations needed to 
reach the goal number became more complex. For example, a 
player may not be able to simply subtract an inner-ring number to 
successfully transform the center number as in easier levels. They 
may instead need to divide by one inner-ring number and then add 
another inner-ring number, or perhaps a more complex series of 
transformations. When a player ran out of rings, they received a 
“Game Over” message and were required to start the current level 
from the beginning. These messages were therefore an indicator of 
the use of inefficient strategies used by the players to solve the 
arithmetic problems presented by the game. 

The game screen for the individual and collaborative play con- 
ditions were nearly identical (see Figure 1A and 1C). They had one 
game interface, which included one center number, five inner-ring 
numbers, and five goal numbers. The only difference between the 
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two was that, in the collaborative condition, both players had 
simultaneous and independent control over the game operations. 
That is, each player could select operators, inner-ring numbers, 
goal numbers, and also hit the fire button. Furthermore, player 
names were displayed in the upper-left portion of the screen, and 
in the upper-right portion of the screen were indicators of game 
performance, which included their current score, level, and number 
of rings. In the competitive condition, players had their own game 
interfaces, which were placed side-by-side (see Figure 1B). Each 
interface was identical to the individual play condition; however, 
indicators of each player’s game performance were present and 
visible to both players. Furthermore, both players could work at 
their own pace, independently advancing through the levels. 


Measures 


Within-game learning and performance measures. Two in- 
dicators of game performance were used. Within-game learning 
was assessed with the total number of problems solved during the 
posttest individual game play period. During this game period, 
players were presented with problems on a similar level of diffi- 
culty as during the pretest and experimental sessions. The number 
of problems they solved, and the challenge level they reached, 
depended on how fast they progressed in the game. Increased 
performance during the posttest should suggest that arithmetic 
learning had occurred during the experimental session. The other 
indicator was the total number of “Game Over” messages players 
received during the experimental game play. When a player ran out 
of rings, he or she received a message stating, “You ran out of 
rings. The FactorReactor was destroyed,” and then players re- 
started the level, which contained the same problems. Players ran 
out of rings either because they failed to solve any problems 
correctly, thereby failing to earn rings, or because they were not 
efficient enough in solving the problems. Therefore, the number of 
times a player received this “Game Over” message was also 
considered indicative of game performance. Pretest performance 
for each indicator was also collected and used as covariates in the 
analyses. 

Achievement goal orientations. Participants were given the 
Achievement Goal Orientation subscale from the Patterns of 
Adaptive Learning Scales (Midgley et al., 2000). The language in 
the scale was simplified to ensure comprehension in our middle- 
school sample and was adapted to be relevant for game play. The 
14-item survey asked students to indicate their level of agreement 
using a 7-point scale (1 = Very much disagree, 4 = Neither agree 
nor disagree, 7 = Very much agree) in response to items such as 
“One of my goals was to learn as much as I could about the game” 
(mastery; a = .87), “One of my goals was to show others that the 
game was easy for me” (performance-approach; a = .84), and “It 
was important to me that my performance on the game didn’t make 
me look stupid” (performance-avoidance; a = .70). 

Situational interest. Situational interest was measured using 
an adaptation of the Situational Interest Survey (Linnenbrink- 
Garcia et al., 2010). The language of the survey was simplified to 
ensure comprehension in our middle-school sample and was 
adapted to be relevant for game play. The survey assessed several 
aspects of situational interest, including affective responses to the 
game (e.g., excitement, fascination) and its personal importance. 
Participants used a 7-point scale anchored at 1 (Very much dis- 
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agree), 4 (Neither agree nor disagree), and 7 (Very much agree) to 
indicate their level of agreement with 12 statements, such as “The 
game was exciting,” “I learned valuable things from the game,” 
and “What I learned from the game is fascinating to me” (a = .92). 

Game enjoyment. Overall enjoyment of the game was as- 
sessed with two questions asking participants to rate the extent to 
which they had fun playing the game and how much they liked the 
game, on a 5-point scale anchored at | (Not at all) and 5 (A lot) 
(a = .80). 

Future game intentions. Two items assessed participants’ 
future intentions regarding FactorReactor. The first assessed in- 
tentions to reengage in the game, asking “Would you play this 
game again in the future?” The other assessed their intention to 
recommend the game to someone else, asking “Would you rec- 
ommend it to your friends/teachers?” Both items were assessed on 
a 5-point scale ranging from | (Not at all) to 5 (Definitely). 

Prior experience with video game controllers. Participants 
were asked to indicate their level of experience with video game 
controllers like the ones used in the study, rated on a 5-point scale 
anchored at | (None) and 5 (A Jot). This variable was used as a 
covariate on the game performance analyses (see Table 1). 

Out-of-game learning measure. Participants were given a 
pre- and posttest of math fluency as an out-of-game assessment of 
arithmetic learning. The measure included 160 simple arithmetic 
problems for which participants were given 3 min to complete as 
many problems as possible. This measure of math fluency was 
adapted from the Woodcock—Johnson II Math Fluency subtest 
(McGrew & Woodcock, 2001), modified by randomizing the 
presentation of problems and by including simple division prob- 
lems as well as addition, subtraction, and multiplication problems. 
The posttest of math fluency was identical to the pretest, though 
the problems were presented in a different, randomized order to 
diminish practice effects. 


Results 


The data were analyzed using hierarchical linear models 
(HLMs). In these models, individuals were nested within pairs for 
the sole purpose of accounting for the correlated variance between 
individuals playing in dyads, which was the case for two of three 
of the experimental conditions (i.e., competitive and collaborative 
play). The main intention of our analyses, however, was to draw 
conclusions at the level of the individual, not the pairs level, so our 
report chiefly focuses on individual-level effects. 

Across all analyses, mode of play was dummy coded with 
competitive play and collaborative play entered into the models, 
and individual play as the reference group. All analyses were run 
using HLM Version 7 (Raudenbush, Bryk, & Congdon, 2011). No 
gender or grade-level differences were found for the dependent 
variables; therefore, gender and grade level are not considered in 
further analyses. 


Game Performance 


Two indicators of game performance were analyzed. In our first 
analysis, we examined the effect of mode of play (individual vs. 
competitive vs. collaborative) on the number of problems solved in 
the posttest of game play. We ran an HLM with number of 
problems solved as the dependent variable, the two mode of play 
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| condition dummy variables, and two covariates: The number of 
pretest play problems solved served as a baseline of game ability, 
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Individual Competitive 


Mode of Play 


Figure 2. Adjusted means for number of problems solved in the posttest 
game play by condition. Values atop each bar represent means (and 
standard deviations). P value reflects comparison with the individual play 
condition. ~ p = .05. 
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The analysis yielded a statistically significant fixed main effect 
for collaborative play (p = .004), suggesting that players in that 
condition had a higher rate of inefficient problem-solving strategy 
use than those in the individual play condition. No such effect was 
found for the competitive condition in relation to the individual 
play condition (see Table 3 for parameter estimates and Figure 3 
for a graphical depiction). A follow-up analysis comparing the 
competitive and collaborative conditions suggested that there was 
no difference in the receipt of “Game Over” messages between the 
groups (p = .11). 

Furthermore, there was a significant random effect, x7(31) = 
817.87, p < .001, suggesting that the number of “Game Over” 
messages received varied between pairs. The ICC (ICC = .96) 
suggested that 96% of the variance in “Game Over” messages 
received was attributable to variability between pairs, whereas 
only 4% was attributable to variability between individual players. 


Achievement Goal Orientations 


Our next set of analyses examined the effect mode of play had 
on participants’ adoption of achievement goal orientations during 
game play. In each of the three analyses, the achievement goal 
orientation score was entered as the dependent variable in an HLM 
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Estimates for Game Performance: Number of “Game Over” 
Messages Received During the Experimental Trial 
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Figure 3. Adjusted means for number of “Game Over” messages re- 
ceived in the experimental play session by condition. Values atop each bar 
represent means (and standard deviations). P value reflects comparison 
with the individual play condition. ““ p = .01. 


along with mode-of-play condition dummy codes as predictors. 
The equations took the following forms: 
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The analysis for mastery goal orientation scores yielded signif- 
icant fixed main effects for both competitive (p = .01) and 
collaborative (p = .04) conditions, suggesting that both conditions 
invoked a stronger mastery goal orientation than did playing the 
game individually (see Figure 4 for a graphical depiction). A 
follow-up analysis, however, showed that mastery goal orientation 
strength did not differ between the competitive and collaborative 
groups (p = .28). Furthermore, there was no significant random 
effect, x°(38) = 38.19, p = .46, suggesting that the strength of 
mastery goal orientations was not attributable to the variability 
between pairs. 

For the performance-approach (Mj,4 = 3.99, SDj,g = 1.32; 
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Figure 4. Mean adoption of mastery goal orientation by condition. Val- 
ues atop each bar represent means (and standard deviations). P value 
reflects comparison with the individual play condition. “ p < .05. “* p < 
AOR 
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significant effects were found. See Tables 4, 5, and 6 for parameter 
estimates for the three goal orientation models. 


Situational Interest 


Our next analysis examined the extent to which mode of play 
invoked situational interest in players during game play. Situa- 
tional interest scores were entered in an HLM as the dependent 
variable along with the two mode-of-play condition dummy vari- 
ables as predictors. The equations took the following forms: 


Level | : Y,,(situational interest) = Bp; + B,;(competition) 
+ B>,(collaboration) + rj. 


Level 2: Bo; ==¥oo + 40; biz — Yio Bo; = ‘Y20- 


The analysis yielded statistically significant fixed main effects 
for both competitive (p = .04) and collaborative play conditions 
(p = .0O1; see Table 7 for parameter estimates and Figure 5 for a 
graphical depiction). These results suggest that playing in either 
competition or collaboration with another player made the game 
more exciting and personally relevant, as measured by the Situa- 
tional Interest scale, than when playing it alone. A follow-up 
analysis comparing competitive and collaborative modes showed 
that they did not differ with respect to situational interest (p = .59). 
Furthermore, no random effect was yielded, y7(40) = 44.35, p = 
.29, demonstrating that the variance in situational interest was not 
explained by the variability between pairs. 


Enjoyment 


We next examined the effect of mode of play on players’ 
enjoyment of the game. Enjoyment scores were entered into the 
HLM as the dependent variable along with the dummy variables 
reflecting the mode-of-play conditions as predictors. The equations 
took the following forms: 


Level 1: Y,,(enjoyment) = Bo; + B,;(competition) 
+ B»(collaboration) + rj. 


Level 2 : Bo; = Yoo + “oj Bij = Yi0> B2j = Yz20- 


The analysis yielded statistically significant fixed main effects 
of competitive (p = .03) and collaborative (p < .001) play on 
game enjoyment as compared with the individual play condition. 
These results suggest that playing the game alone was significantly 
less enjoyable than playing it either competitively or collabora- 
tively (Table 8 lists the parameter estimates for the model, and 
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Table 5 
Estimates for Performance-Approach Goal Orientation 

Fixed effects Coefficient SH 
Intercept 3.99" 32M 
Competitive play 61 5)! 
Collaborative play 04 poll 
Random effects Variance 
Pairs intercept 58 
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Figure 6 provides a graphical depiction). A follow-up analysis 
comparing the competitive and collaborative groups demonstrated 
that they did not differ with regard to their enjoyment of the game 
(p = .38). 

Finally, there was a significant random effect suggesting that 
enjoyment of the game varied between pairs, y7(40) = 68.60, p = 
.003. The ICC (ICC = .36) suggested that 36% of the variance in 
game enjoyment was explained by the variability between pairs, 
whereas 64% was explained by the variability between individual 
players. 


Future Game Intentions 


Two indicators of players’ future intentions with regard to the 
game were examined. The first analysis examined the reported 
likelihood participants would play the game again. Reengagement 
intentions were entered into the hierarchical model as the depen- 
dent variable along with mode-of-play condition dummy codes. 
The equations took the following forms: 


Level 1: Y;(reengagement intentions) = Bo; + 8; (competition) 
+ B»;(collaboration) + rj. 


Level 2: Boj = Yoo + Uo; Bij = Y10» Baj = ‘Yz20- 


The analysis resulted in a statistically significant fixed main 
effect of the collaborative condition (p = .03), such that players in 
that condition reported a higher likelihood of playing the game 
again than those who played the game individually. No such effect 
was found for the competitive condition, however. Those partici- 
pants’ intentions to reengage in the game were no different than 
those in the individual play group. See Table 9 for the model 
parameter estimates and Figure 7 for a graphical depiction. A 
follow-up analysis further suggested that intentions to play the 
game again were no different for those in the competitive and 
collaborative play conditions (p = .57). Furthermore, there was no 
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Table 7 
Estimates for Situational Interest 

Fixed effects Coefficient SE 
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significant random effect, x7(39) = 36.93, p > .50, suggesting that 
the variance in intentions to reengage in the game was not due to 
variability between pairs. 

The second future intention examined was players’ intention to 
recommend the game to a friend or teacher. Therefore, recommen- 
dation intentions were added to the hierarchical model as a depen- 
dent variable along with the mode-of-play condition dummy vari- 
ables. The equations took the following forms: 


Bevelult Y (recommendation intentions) = Bo; 
i 8, (competition) af B>;(collaboration) + Ti: 


Level 2: Boj = Yoo + Uo; Bxj = Yi0> B2j = Y20- 


The analysis yielded a statistically significant fixed effect for the 
collaborative condition (p = .01), but not the competitive condi- 
tion. These results suggest that playing the game collaboratively 
led participants to report a stronger intention to recommend the 
game to someone else than those who played the game individu- 
ally (see Table 10 for parameter estimates and Figure 8 for a 
graphical depiction). A follow-up analysis comparing the compet- 
itive and collaborative conditions yielded a null result (p = .43), 
demonstrating that intentions to recommend the game did not 
differ between the two groups. Furthermore, there was no signif- 
icant random effect, x7(39) = 40.24, p = .42, suggesting that 
recommendation intentions did not vary between pairs. 


Math Fluency 


Before investigating the effect of game play condition on math 
fluency, we first examined whether there was an overall change 
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Figure 5. Mean situational interest scores by condition. Values atop each 
bar represent means (and standard deviations). P value reflects comparison 
with the individual play condition. “ p = .05. ™ p S .O1. 
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from pre- to posttest fluency scores (see Table 11). A paired r test 
comparing pre- and posttest fluency scores was conducted; how- 
ever, one participant did not complete the posttest and was thus 
omitted from the analysis. Results suggested that posttest fluency 
scores (M = 70.42, SD = 26.67) were statistically significantly 
higher than pretest scores (M = 66.86, SD = 26.42), t(56) = 
—2.59, p = .01. Therefore, players increased their math fluency 
from pre- to posttest. 

Next, we analyzed posttest math fluency scores to investigate 
the effect of condition. Dummy-coded modes of play were entered 
into the model along with pretest fluency scores as a covariate. The 
equations took the following forms: 


eve lleeyg ij(posttest fluency scores) = Bo; + By ;(competition) 


+ B»;(collaboration) + B;,(pretest fluency scores) + rj. 


Level 2 : Bo; = Yoo + “oj Bij = Yio» Bo; = Y20> B3; en 80 


Although the effect of pretest fluency scores was found to be 
significant (p < .001), indicating a positive relation with posttest 
scores, the analysis indicated no effect of collaborative play or 
competitive play on posttest math fluency scores. The null result 
suggests that there were no differences in fluency scores between 
the individual (M = 65.63, SD = 15.21), competitive (M = 78.68, 
SD = 36.47), and collaborative (M = 66.77, SD = 22.30) game 
play conditions. The effect of the grouping variable, pairs, on 
posttest math fluency scores was found to be not statistically 
significant. This indicates that none of the variance in posttest 
math fluency scores is attributable to the pairings after accounting 
for variability from pretest fluency scores. Furthermore, there was 
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Figure 6. Mean game enjoyment scores by condition. Values atop each 
bar represent means (and standard deviations). P value reflects comparison 
with the individual play condition. * p < .05. *** p < .001. 
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Table 9 
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no significant random effect, x*(39) = 33.01, p > .50, suggesting 
that math fluency did not vary between pairs. 


Discussion 


The goal of the present research was to investigate the learning, 
performance, and motivational outcomes associated with playing 
an educational math game either competitively or collaboratively 
as compared with individually. With a few exceptions, our predic- 
tions were confirmed. 

Two analyses were conducted to assess the affect of mode of 
play on within-game learning and performance. The first analysis 
examined the number of problems solved in the posttest, which 
showed that, in comparison to individual play, performance was 
better for competitive, but not collaborative play. Playing compet- 
itively may have aided in the development of arithmetic skills such 
that players were able to solve more problems during the within- 
game posttest. Our second analysis examined the efficiency of 
problem-solving strategies used by learners during the experimen- 
tal session, operationalized as the number of “Game Over” screens 
received by the player, which found that collaborative play re- 
sulted in worse performance than individual play. There was no 
difference, however, in performance between competitive and 
individual modes of play. 

There may be different explanations for these results. One 
possibility is that our findings may be specific to the game used in 
the present study. Indeed, collaboration has been shown to be 
beneficial for motivation and learning under numerous circum- 
stances (e.g., Deutsch & Krauss, 1960; Hanze & Berger, 2007; 
Nichols, 1996; Nichols & Miller, 1994; Sharan & Shaulov, 1990; 
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Figure 7. Mean intent to play game again by condition. Values atop each 
bar represent means (and standard deviations). P value reflects comparison 
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Table 10 
Estimates for Future Game Intentions: Recommendation 
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Slavin, 1988). FactorReactor, however, requires players in the 
collaborative mode to communicate with each other, negotiating 
which strategy to select, and who will execute which move. For the 
automation of arithmetic fluency, these particular tasks may be 
best suited for modes in which players are in sole control of their 
game space, as they were in the individual and competitive modes 
of the present study. This is in line with findings by Mullins et al. 
(2011), who found that the mutual elaborations and explanations 
were beneficial for conceptual knowledge, but not for skill devel- 
opment. Another possible explanation is that the relatively short 
game play penalized players for their collaborative meaning- 
making and exploration, which was reflected in fewer problems 
solved, and more inefficient strategies explored, than individual 
play. In a longer game play, this initial exploration may have 
eventually resulted in better performance than individual or com- 
petitive play, which should be investigated in future research. 

It should also be noted that it is uncertain whether within-game 
learning occurred because players in the competitive condition had 
improved their math fluency or whether there were other expla- 
nations. For example, competitive players may have increased 
their fluency of the game mechanics relative to those in other 
modes of play, or honed their strategies more effectively. In other 
words, it is possible that they improved their game-playing skills 
rather than their arithmetic skills. Future research will need to 
investigate these possible sources of increased fluency. 

Another set of analyses examined out-of-game learning, which 
was assessed using timed paper-and-pencil tests before and after 
participants played the game. It was found that players’ math 
fluency scores had improved overall. Without additional data, 
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Figure 8. Mean intent to recommend game by condition. Values atop 
each bar represent means (and standard deviations). P value reflects com- 
parison with the individual play condition. ™ p = .01. 
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however, we cannot necessarily claim that the increase in scores 
was due to playing the game rather than reflecting a test-taking 
effect. Contrary to our predictions, a second analysis showed that 
posttest scores did not vary by condition. Although it is possible 
that the skills acquired during game play do not transfer to out- 
of-game measures of learning, the relatively short duration of 
game play may have not been sufficient for the transfer to occur. 

A series of analyses were also conducted to examine the effect 
of mode of play on multiple indicators of motivation. In compar- 
ison to individual play, competitive and collaborative play resulted 
in the strongest mastery goal orientation, which is associated with 
highly adaptive patterns of motivation and learning (Ames, 1992; 
Midgley et al., 2001). This finding suggests that these modes of 
play may impact students’ learning-related goals to focus more on 
learning the subject matter, improving, and finding the most opti- 
mal strategies, and less on normative comparisons with other 
students or validating their abilities. This notion is further sup- 
ported by the fact that we found that the competitive and collab- 
orative modes of play did not differ from individual play in their 
invocation of performance-approach and performance-avoidance 
goal orientations. This null finding may have stemmed from the 
way in which participants experienced the competitive mode. 
Although performance goals are concerned with outperforming 
others, it is in the service of demonstrating normative ability (e.g., 
Grant & Dweck, 2003; Urdan & Mestas, 2006). Indeed, perfor- 
mance goals and competition are different constructs. Playing in 
competition with another student may not be sufficient to invoke 
concerns about normative ability. If a student were to play against 
all of his or her classmates and their scores were made available to 
each other, however, a concern for normative performance may be 
elicited, along with a performance goal. In other words, although 
competition may play a role in the invocation of performance 
goals, such that there exists a desire to outperform others, our data 
suggest that the way in which competition was operationalized in 
FactorReactor was not sufficient to invoke performance goal 
orientations. Given the properties of the game we used, our results 
suggest that in the context of a learning game, competition with 
only one other player, rather than all other classmates, may be an 
effective means of invoking a mastery goa! orientation without the 
negative outcomes associated with the invocation of performance 
goal orientations. 

Our results also demonstrated that competitive and collaborative 
play increased situational interest and game enjoyment in relation 
to individual play. That these constructs were augmented has 
particularly important implications for the use of these modes of 
play in educational games. First, students are more likely to engage 
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in a task they perceive to be enjoyable (Salen & Zimmerman, 
2003), thereby increasing their exposure to the educational con- 
tent. Second, the invocation of situational interest suggests that the 
effect of the game reaches beyond mere enjoyment. Relative to the 
individual play condition, players in the competitive and collab- 
orative conditions experienced the game as personally involving 
and that the content of the game was valuable and personally 
relevant. This increase in situational interest lays a foundation on 
which a more internalized and enduring interest, individual inter- 
est, is built. 

Additionally, collaborative play increased participants’ inten- 
tion to play the game again and to recommend the game to another. 
This supports the notion that games not only engage students in 
particular learning activities and content but also increases the 
likelihood of reengagement over time, in and out of classroom 
(Gee, 2007; Squire, 2003). It also suggests that they may foster the 
development of a more internalized individual interest that intrin- 
sically guides students’ future learning endeavors, both alone and 
assisted by an instructor (Bergin, 1999; Deci, Vallerand, Pelletier, 
& Ryan, 1991). 

It should be noted that our indicators of motivation were gen- 
erally assessed in terms of the game itself. Therefore, it is possible 
that our results reflect motivational responses to the game rather 
than arithmetic. The intention of educational games, however, is to 
provide a context that engages learners and motivates them to 
reengage over time. Furthermore, the effectiveness of these games 
is attributable, in part, to their ability to reengage learners. For 
example, a student who enjoys a math game may play it fre- 
quently, resulting in increased exposure to and practice with math- 
ematical operations. Even so, a number of the items used to assess 
motivation referred specifically to the learning content of the 
game, as with our assessment of situational interest, which like- 
wise resulted in our predicted effects. 

Taken together, our results suggest that there are benefits and 
costs associated with particular modes of play. Although the com- 
petitive and collaborative modes elicited the strongest motivation 
and interest, and increased the degree to which mastery goal 
orientations were adopted, the collaborative condition resulted in 
the highest frequency of inefficient strategy use, yet led to more 
positive attitudes toward the game. More specifically, participants 
in the collaborative condition had to restart the most levels, sug- 
gesting that their collaborations were inefficient and error-prone, 
and led to the use of poor strategies as compared to those in the 
individual mode. Yet, collaborative play also led to greater inten- 
tions to play the game again, suggesting that, over time, this 
negative effect could be resolved. 


Limitations and Future Research 


As is the case for all empirical studies, there are some limita- 
tions to the generalizability of our findings. Most importantly, the 
results of this study cannot be readily generalized to all educational 
games. The game used in this research, FactorReactor, was de- 
signed for the practice and automation of arithmetic skills to 
increase fluency in middle-school students. There are many genres 
of games with features that differ significantly from this game, 
such as role-playing games, adventure games, augmented reality 
games, or first-person shooters. Because of the corresponding 
design differences, a different effect of mode of play might be 
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expected for other game genres. For example, we found collabor- 
ative play to increase the rate of adoption of inefficient and 
error-prone strategies during the game. We would not, however, 
suggest that collaboration is detrimental to performance in general. 
The characteristics of this particular game may not have been ideal 
for collaborative play within a 15-min period, which may have 
been alleviated by the fact that we chose, for the reasons outlined 
above, to keep instruction on how to collaborate to a minimum. 
Future research will need to investigate whether our findings can 
be replicated with games with similar objectives, but from other 
game genres. Future work should also examine the effects of other 
design factors, and should investigate whether the effects found in 
the present study are different for games that cover different kinds 
of knowledge, for example, whether collaborative game play 
would result in better learning of conceptual knowledge, as sug- 
gested by Mullins et al. (2011). We are also interested in conduct- 
ing further research to explore the learning processes in individual 
versus collaborative and competitive game modes by collecting 
process data such as biometrics and eye tracking (Aleven, Rau, & 
Rummel, 2012). 


Conclusion 


The results of this study, which provide initial evidence for the 
effect of social context in game-based development of arithmetic 
fluency, have important theoretical and practical implications. 

On the theoretical side, we demonstrated that although only the 
competitive mode of play increased within-game learning, both 
competitive and collaborative modes of play increased situational 
interest, enjoyment, and the adoption of a mastery goal orientation, 
compared with individual play. These results are in line with 
previous research in computer-supported learning of mathematics 
that showed that benefits of collaboration were only found for 
conceptual knowledge, but not found for skills acquisition (Mul- 
lins et al., 2011). Our research extended these findings by also 
considering the impact of a form of competition that has the 
benefits of increased performance while still invoking a mastery 
goal orientation rather than a performance goal orientation. It is 
especially interesting that although resulting in inefficient use of 
problem-solving strategies and error-prone game play, collabora- 
tive play was associated with greater enjoyment, situational inter- 
est, and intention of reengagement than individual play. These 
results fit within a framework of learning with media that recog- 
nizes the importance of social context and related affective vari- 
ables in addition to cognitive ones (Moreno & Mayer, 2007). 

On the practical side, this research provides empirical support 
for the potential of educational games as effective learning envi- 
ronments that provide incentives for students to play repeatedly 
over time. Our results demonstrate that game designers need to 
earnestly consider the differential effects of competitive and col- 
laborative modes of a game in skill fluency development. Al- 
though both modes of social play increase situational interest and 
future intentions to play, only the competitive mode resulted in 
increases in game performance compared with individual play, 
whereas collaborative play resulted in the adoption of less efficient 
problem-solving strategies. This research also highlights that many 
of the outcomes of learning with gamelike environments are of an 
affective nature and that such affective outcomes of motivation 
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and interest have to be considered in addition to the cognitive 
learning outcomes of a game. 

In summary, the research reported in this study provides empir- 
ical support for a social context design pattern that emphasizes 
competitive modes of play over collaborative and individual play 
for games aimed at developing arithmetic skill fluency and adopt- 
ing of a mastery goal orientation, as well as increasing situational 
interest and enjoyment. 
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Guiding Learners Through Technology-Based Instruction: 
The Effects of Adaptive Guidance Design and Individual Differences on - 
Learning Over Time 


Adam M. Kanar 
Brock University 


Bradford S. Bell 


Cornell University 


Adaptive guidance is an instructional intervention that helps learners to make use of the control inherent 
in technology-based instruction. The present research investigated the interactive effects of guidance 
design (i.e., framing of guidance information) and individual differences (i.e., pretraining motivation and 
ability) on learning basic and strategic task skills over time. One hundred thirty participants were 
randomly assigned to 1 of 2 types of adaptive guidance (autonomy supportive, controlling) or a 
no-guidance condition while learning to perform a complex simulation task over 9 consecutive trials. 
Results indicated that participants receiving controlling guidance acquired strategic task skills at a faster 
rate than participants receiving autonomy-supportive guidance or no guidance. The design of adaptive 
guidance also moderated the effects of pretraining motivation and cognitive ability on learners’ acqui- 
sition of basic and strategic task skills. Specifically, autonomy-supportive guidance enhanced the positive 
effects of pretraining motivation on the acquisition of basic task skills, and controlling guidance enhanced 
the positive effects of cognitive ability on the acquisition of strategic task skills. Implications for research 


and practice are discussed. 


Keywords: learning, technology, guidance, individual differences, performance 


Over the past decade, a number of different forces, including 
technological advances, economic pressures, and globalization, 
have spurred significant growth in technology-based instruction in 
both higher education and corporate settings. For instance, the 
National Center for Education Statistics estimates that from 2000 
to 2008, the percentage of undergraduates enrolled in at least one 
distance education course grew from 8% to 20% (Radford, 2011). 
Similarly, the American Society for Training and Development 
estimates that the percentage of learning delivered through tech- 
nology in work organizations has increased from 8.8% in 2000 to 
38.5% in 2011 (Miller, 2012; Van Buren & Erskine, 2002). 

One important implication of this trend in learning delivery is 
that technology-based instruction often provides learners with sig- 
nificant control over different aspects (e.g., content, sequence, 
pace) of their learning (DeRouin, Fritzsche, & Salas, 2004). 
Kraiger and Jerden (2007), for example, noted that many modern 
forms of technology-based instruction follow a learner-centered 
format in which the software serves as a learning portal, and 
individuals must make choices about both what and how to learn. 
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When compared with conditions in which instructional software 
controls most or all of the learning decisions (1.e., program con- 
trol), learner control often has a positive, albeit small, effect on 
student outcomes (Kraiger & Jerden, 2007; Reeves, 1993). Yet, 
researchers have also noted that instruction that offers high levels 
of learner control often proves ineffective because learners expe- 
rience resource depletion, fail to come into contact with important 
information, and make poor learning decisions (Brown, 2001; 
Kirschner, Sweller, & Clark, 2006; Mayer, 2004). 

These findings highlight the need for instructional strategies that 
can assist learners in making effective use of the control offered by 
technology-based instruction. One approach that has been exam- 
ined involves supplementing learner control with adaptive guid- 
ance, which provides learners with diagnostic and interpretive 
information designed to help them make more effective learning 
decisions (Bell & Kozlowski, 2002). Although research has shown 
that adaptive guidance leads to better learning outcomes than 
either total learner or program control (e.g., Bell & Kozlowski, 
2002; Corbalan, Kester, & van Merriénboer, 2008), the issue of 
how adaptive guidance should be designed to optimize student 
learning in technology-based instruction remains largely unex- 
plored. 

One instructional design feature that may have an important 
impact on student achievement is the framing of guidance infor- 
mation. Prior research has demonstrated that how learning instruc- 
tions and activities are framed can have a significant impact on 
learning (e.g., Kozlowski & Bell, 2006; Rawsthorne & Elliot, 
1999). For instance, drawing on self-determination theory (SDT), 
investigators have shown that learning contexts that are framed as 
autonomy supportive lead to higher levels of motivation and 
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learning than contexts that are framed as controlling (e.g., Black & 
Deci, 2000; Vansteenkiste, Simons, Lens, Sheldon, & Deci, 2004). 
These findings suggest that guidance information should be 
framed so as to minimize perceptions of external control and 
emphasize learners’ autonomy and freedom. Resource allocation 
theories of self-regulation (e.g., Kanfer & Ackerman, 1989), how- 
ever, suggest that providing greater autonomy and choice may 
deplete learners’ cognitive resources and impede skill acquisition, 
particularly in learning contexts that impose substantial demands 
on attentional resources. Thus, guidance that is framed as more 
controlling and restrictive may reduce the burden on learners, 
allow them to direct more of their attentional resources to learning, 
and increase the likelihood that learners’ come into contact with 
important learning content (Mayer, 2004). 

In the current study, we explore these different perspectives 
through an examination of the effects of two forms of adaptive 
guidance—autonomy supportive and controlling—on learning 
during a complex simulation-based training program. This effort 
advances the existing literature in at least three ways. First, using 
SDT and resource allocation theory, we propose that the effects of 
different adaptive guidance designs may vary across different 
learning outcomes. To test this prediction, we examine the effects 
of autonomy-supportive and controlling guidance on multiple in- 
dicators of learning, namely, the acquisition of basic and strategic 
task skills. Second, recent studies suggest that individual differ- 
ences often moderate the effects of interventions designed to 
improve learning during technology-based instruction, such that a 
specific intervention will be more effective for some learners than 
others (e.g., Sitzmann, Bell, Kraiger, & Kanar, 2009). Building on 
and extending these findings, we examine how different forms of 
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guidance interact with individual differences related to effort (1.e., 
pretraining motivation) and resource availability (e., cognitive 
ability) to influence learning. Finally, we use a longitudinal design 
and latent growth modeling to examine the effects of the two forms 
of adaptive guidance over time. Whereas most research has treated 
the effects of guidance as static, our longitudinal approach exam- 
ines the impact of the different types of adaptive guidance on 
individuals’ learning trajectories over the course of instruction, 
which provides further insight into how different forms of guid- 
ance influence the acquisition of different types of task skills. The 
conceptual model examined in this research is presented in Figure 
1. In the following sections, we discuss the theory that underlies 
the relationships outlined in the model. 


Adaptive Guidance 


Although there is some evidence that learner control can en- 
hance student motivation and satisfaction (e.g., Reeves, 1993), 
research suggests that individuals often do not make effective use 
of the control they are given over their instruction (Steinberg, 
1977, 1989). Learners frequently misinterpret feedback and are 
poor judges of their performance and progress, which can lead to 
poor learning choices and misdirected effort. Brown (2001), for 
example, studied learner choices during online instruction and 
found that learners commonly skipped critical practice opportuni- 
ties, and some spent less than 50% of the available time in the 
course. He concluded, “Results suggest that, despite the appeal of 
computer-based training as a way to make learning more efficient, 
employees may not use control over their learning wisely” (p. 
290). Mayer (2004) leveled similar criticisms against discovery 
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Figure 1. Conceptual model of predicted relationships between adaptive-guidance design, individual- 
difference factors, and performance trajectories. HI, H2, H3, H4, and HS = Hypothesis 1, Hypothesis 2 


Hypothesis 3, Hypothesis 4, and Hypothesis 5. 
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learning, in which students are free to work in the learning envi- 
ronment with little or no guidance. He reviewed research that 
compared pure and guided discovery methods and concluded that 
guided methods help ensure that students come into contact with 
to-be-learned material and better support the cognitive processes 
necessary for constructivist learning. Finally, Kirschner et al. 
(2006) argued that unguided environments create a heavy working 
memory load that is detrimental to learning. 

Although guided instruction can take many forms (cf. Kirschner 
et al., 2006), in the current study we focus on adaptive guidance, 
which was designed for more complex learning environments that 
leverage technology (Bell & Kozlowski, 2002). Adaptive guidance 
was developed on the basis of a foundation provided by learner 
control research (e.g., Tennyson, 1980; Tennyson & Buttrey, 
1980), but was also designed to extend to more complex learning 
domains that require learners to acquire not only basic but also 
strategic task skills. Basic task skills involve a trainees’ ability to 
perform fundamental task operations that must be learned in order 
to develop more advanced skills (Bell & Kozlowski, 2002). Indi- 
viduals use their declarative knowledge (e.g., knowledge of facts) 
and procedural knowledge (e.g., knowledge of rules) when per- 
forming basic skills. Through practice and experience, declarative 
knowledge is compiled or proceduralized, which allows trainees to 
execute basic operations more quickly and with fewer errors 
(Anderson, 1983). Strategic skills involve carrying out more dif- 
ficult operations that require trainees to understand the underlying 
complexities of a task and integrate task concepts. In addition, 
trainees must develop contextual knowledge that informs why, 
when, and where to apply their strategic skills (Ford & Kraiger, 
1995). Thus, strategic performance involves selectively retrieving 
and integrating specific knowledge from one’s knowledge base 
and applying the resulting constructions to varying task contin- 
gencies (Tennyson & Breuer, 2002). In environments that re- 
quire both basic and strategic skills, learning is a function of not 
only effort (e.g., time on task) but also the quality of study and 
practice activities. Thus, adaptive guidance uses learners’ past 
performance to provide evaluative and diagnostic information 
that assists them in judging their progress toward task mastery, 
which should influence the amount of effort they invest in 
learning. In addition, it provides individualized suggestions for 
what learners should study and practice, based on progress, 
which should influence the allocation of attention and lead to 
better learning choices. 

Bell and Kozlowski (2002) showed that adaptive guidance helps 
learners to make better learning decisions in a learner-control 
environment. Learners who received guidance studied and prac- 
ticed training material in a more appropriate sequence than those 
who received no guidance. Guidance also had a positive effect on 
trainees’ self-efficacy early in training, when learning is most 
challenging and errors are common. The result was that learners 
who received adaptive guidance exhibited higher levels of basic 
and strategic knowledge and performance and were better able to 
transfer their skills than those who were given learner control 
without guidance (Bell & Kozlowski, 2002). Accordingly, we 
expect that learners receiving adaptive guidance will exhibit 
greater positive change in their performance relative to those in a 
no-guidance condition. 
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Hypothesis 1: Participants who receive adaptive guidance will 
exhibit more positive change in basic and strategic perfor- 
mance skills than participants who do not receive guidance. 


Autonomy-Supportive and Controlling Guidance 


The issue of how adaptive guidance should be designed to 
optimize student learning in technology-based instruction has re- 
ceived limited research attention. Adaptive guidance seeks to 
provide the direction learners need to avoid making poor learning 
decisions while retaining the motivational benefits of autonomy. 
SDT (for a review, see Ryan & Deci, 2000) is a theory of 
motivation that assumes high-quality motivation is inherently hu- 
man and is expressed to different degrees depending on the context 
that influences the process of making choices. Initial conceptual- 
izations of motivation quality distinguished between motivations 
stemming from an internal locus of causality (e.g., interest and 
enjoyment) and those stemming from an external locus of causality 
(e.g., rewards and punishments) (Vansteenkiste, Lens, & Deci, 
2006). A more recent conceptualization, however, distinguished 
among various types of extrinsic motivation that differ in their 
degree of autonomy, which shifted the focus to differences be- 
tween autonomous motivation, which involves the experience of 
volition and choice, and controlled motivation, which involves the 
experience of being pressured or coerced (Vansteenkiste et al., 
2006). Prior research has shown that learning contexts that provide 
choice and options for self-direction tend to facilitate autonomous 
motivation and enhance learning, whereas controlling environ- 
ments that pressure learners to think or act in a particular way often 
diminish autonomous motivation and lead to poorer learning 
(Ryan & Deci, 2000; Vansteenkiste et al., 2004). 

A common means of operationalizing autonomy-supportive and 
controlling learning environments is through the framing of in- 
structions. For example, a number of laboratory and field studies 
have revealed that verbal or written instructions containing pri- 
marily autonomy-supportive phrases (e.g., “you may” or “if you 
choose”) lead to higher levels of autonomous motivation and 
learning than instructions with more controlling phrases (e.g., “you 
should” or “you have to”; Vansteenkiste et al., 2004). Thus, 
presenting adaptive guidance instructions using autonomy- 
supportive language may capitalize on these motivational benefits 
and lead to greater learning performance than guidance instruc- 
tions incorporating controlling language. 

As previously noted, however, prior learner control research has 
revealed that greater autonomy does not always translate into 
higher levels of learning, and in fact sometimes leads to poorer 
performance (e.g., Pollock & Sullivan, 1990). A closer examina- 
tion of this research suggests that these mixed findings may be due, 
at least in part, to differences in the learning outcomes examined 
across studies. For example, a meta-analysis by Patall, Cooper, and 
Robinson (2008) revealed small and positive effects of choice on 
simple task performance (i.e., quantity and accuracy), but they did 
not find a significant relationship between choice and subsequent 
measures of learning that assessed skill acquisition. Overall, they 
concluded that research examining the effects of choice on learn- 
ing has yielded findings that have been “somewhat inconsistent” 
(Patall et al., 2008, p. 294). Accordingly, it may be important to 
consider how different forms of guidance potentially impact dif- 
ferent types of learning outcomes. Ackerman (1987), for instance, 
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found that motivation and effort are the primary determinants of 
learners’ acquisition of declarative knowledge and performance on 
simple tasks. Learners’ motivation influences performance on ba- 
sic tasks because, through practice and experience, learners de- 
velop knowledge of facts (declarative knowledge) and rules (pro- 
cedural knowledge) and thus are able to perform tasks quicker and 
with fewer errors (Anderson, 1983). Thus, the motivational bene- 
fits of autonomy-supportive guidance should be evident on basic 
task components, where performance is determined primarily by 
effort (Bell & Kozlowski, 2002). Accordingly, we expect that 
learners receiving autonomy-supportive guidance will acquire ba- 
sic skills at a faster rate than learners receiving controlling guid- 
ance. 


Hypothesis 2: Participants in the autonomy-supportive condi- 
tion will exhibit more positive change in basic performance 
skills than participants in the controlling condition. 


The positive effects of choice in learner-controlled training, 
however, may not extend to learning outcomes that are a function 
of a trainee’s ability to process and integrate complex information. 
Acquisition of more complex task skills is closely tied to processes 
related to learners’ attention, such as choices made during training 
(e.g., sequence of study) and the quality of practice (Bell & 
Kozlowski, 2002; Brown, 2001), and guidance that is more con- 
trolling may increase the likelihood that trainees engage in appro- 
priate study and practice activities. In addition, guidance design 
features that facilitate (rather than restrict) a learner’s sense of 
autonomy increase the number of potential problem solutions and 
amount of information that needs to processed. As the total amount 
of information increases, people must rely on less information to 
make choices, resulting in simplified problem-solving and 
decision-making processes and suboptimal outcomes (Chua & 
Iyengar, 2008; Payne, Bettman, & Johnson, 1993). For example, 
Iyengar and Lepper (2000) found that a greater number of options 
decreased people’s ability to think about multiple solution combi- 
nations. By directing learners’ attention to key elements of the task 
and limiting learners’ choices, controlling guidance may enhance 
the acquisition and integration of skills for performing more com- 
plex components of the task. Thus, we expect that learners receiv- 
ing controlling guidance will acquire strategic skills at a faster rate 
than learners receiving autonomy-supportive guidance. 


Hypothesis 3: Participants in the controlling condition will 
exhibit more positive change in strategic performance skills 
than participants in the autonomy-supportive condition. 


Interactive Effects of Guidance Design and 
Individual Differences 


Although autonomy may yield motivational benefits during 
training, it is also important to consider trainees’ motivation when 
entering a training program (i.e., pretraining motivation). Pretrain- 
ing motivation describes trainees’ initial attitudes and intentions to 
exert effort toward learning the content of a training program (Noe, 
1986). Pretraining motivation is different from motivation quality 
constructs because pretraining motivation implies attitudes and 
personal action (activation) directed toward learning; motivation 
quality constructs address the beliefs and reasons underlying dif- 
ferent types of motivation (Vansteenkiste, Sierens, Soenens, Luy- 
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ckx, & Lens, 2009). Motivated action theories have shown that 
attitudes and intentions provide the link between beliefs and be- 
haviors (Heckhausen & Kuhl, 1985). Indeed, learning orientation 
strongly and positively predicts trainees’ pretraining motivation 
levels (Colquitt & Simmering, 1998; Klein, Noe, & Wang, 2006), 
and motivation to learn has been shown, in turn, to positively relate 
to learning outcomes (Colquitt, LePine, & Noe, 2000). 

Although pretraining motivation has been shown to be a positive 
predictor of training outcomes, research has also revealed that 
individual characteristics often interact with training design to 
influence learning (i.e., Aptitude Treatment interactions). Gully, 
Payne, Koles, and Whiteman (2002), for example, found that 
trainees higher in openness to experience had, in general, higher 
declarative knowledge, training performance, and self-efficacy. In 
addition, they found that when the training was designed to en- 
courage exploratory behaviors consistent with this dispositional 
characteristic, the positive relationship was strengthened. How- 
ever, when the training was designed to restrict exploration, the 
positive effect of openness on the training outcomes was nullified. 
In the current study, we propose that guidance design may play a 
similar role in either enhancing or constraining the positive rela- 
tionship between pretraining motivation and skill acquisition. In 
particular, autonomy-supportive guidance should support trainees’ 
desire to take personal action toward learning the training content, 
thus strengthening the relationship between pretraining motivation 
and learning. However, guidance that is framed as controlling 
should contradict trainees’ positive attitudes and intentions toward 
the training, thus weakening the relationship between pretraining 
motivation and learning. Consistent with our earlier arguments, we 
expect the interaction between guidance design and trainees’ pre- 
training motivation will be observed for basic skill acquisition, 
which is determined primarily by trainees’ motivation and effort. 


Hypothesis 4: Pretraining motivation will be positively related 
to basic performance growth for participants receiving 
autonomy-supportive guidance, and this relationship will be 
weaker for participants receiving controlling guidance. 


In more complex learning environments, it is important to design 
training to support not only trainees’ motivation but also their cogni- 
tion (Bell & Kozlowski, 2008). Cognitive ability, which is an indi- 
vidual’s intellectual capacity, has been shown to be a potent predictor 
of learning (Colquitt et al., 2000; Ree & Earles, 1991). In general, 
individuals with higher levels of cognitive ability have greater atten- 
tional resources to devote to learning, which means they are able to 
absorb and retain more information than individuals with lower cog- 
nitive ability. The challenge in learner-controlled environments is 
ensuring that trainees allocate their attentional resources to study and 
practice activities that facilitate learning. DeRouin et al. (2004) sug- 
gested that when trainees are given too much control, “they may be 
unable to focus the majority of their attention on the subject matter of 
the instructional program” (p. 154), which can cause learning to 
suffer. Niederhauser, Reynolds, Salmen, and Skolmoski (2000), for 
example, examined the effects of hypertext navigation features on 
learning. They found that students who made extensive use of 
compare-and-contrast links, which were designed to provide alternate 
paths to information, exhibited impaired learning, whereas students 
who read the text in a systematic and sequential manner performed 
significantly better. Niederhauser et al. (2000) suggested that the 
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compare-and-contrast links impeded learning because they required 
learners to make decisions about what to read and the order in which 
to read information, which likely absorbed attentional resources that 
could no longer be directed to integrating new knowledge. Consistent 
with these findings, we expect that by providing learners with a clear 
and unambiguous path for navigating the training, controlling guid- 
ance should enable trainees to devote more of their attentional re- 
sources to learning. This should strengthen the positive relationship 
between cognitive ability and performance, particularly on strategic 
task components that require deeper comprehension and integration of 
task concepts. In contrast, the relationship between cognitive ability 
and strategic performance should be weakened when trainees are 
given autonomy-supportive guidance because the greater choice op- 
tions may increase the chances that attentional resources are misdi- 
rected or absorbed by instructional decisions. 


Hypothesis 5: Cognitive ability will be positively related to 
strategic performance growth for participants receiving con- 
trolling guidance, and this effect will be weaker for partici- 
pants receiving autonomy-supportive guidance. 


Method 


Participants 


Participants were 130 undergraduate students enrolled in an 
introductory human resource management course at a large north- 
eastern university who earned course credit for participation. Fifty- 
nine percent of the participants were male, and most (93.1%) were 
between 18 and 21 years old. 


Task 


The task used in this study was a version of TANDEM (Dwyer, 
Hall, Volpe, Cannon-Bowers, & Salas, 1992), a computer-based 
radar-tracking simulation designed for assessing judgment and deci- 
sion making in complex task environments. The object of the simu- 
lation was to make correct decisions about unknown—and potentially 
hostile—contacts appearing on a simulated radar screen and to pre- 
vent contacts from crossing defensive perimeters. Participants were 
required to detect, identify, and act on the multiple contacts on the 
screen using a number of basic and strategic skills (Bell & Kozlowski, 
2002; Kozlowski & Bell, 2006). All participants had access to an 
online instruction manual that contained complete information on all 
important aspects of the simulation. 

Basic skills involved making decisions about contacts on the 
radar screen. After engaging a contact, participants could access 
cue information from pull-down menus, with three cues available 
for each of three component decisions regarding the Type (air, 
surface, submarine), Class (civilian or military), and Intent (hostile 
or peaceful) of the contact. After making the three component 
decisions, participants needed to decide whether to take action 
against the contact (if hostile) or clear it from the radar screen (if 
peaceful). Participants received points for correct decisions and 
lost points for incorrect decisions. 

The basic skills serve as the foundation for developing more 
strategic skills focused on perimeter defense and contact prioriti- 
zation. Specifically, there are two defensive perimeters located 
within the task, and participants lose points for perimeter intru- 
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sions. The inner defensive perimeter is clearly marked and easy for 
participants to identify. However, the outer perimeter is beyond the 
initial viewing range of the radar display and is not clearly marked. 
Thus, participants must learn how to “zoom out’ and locate 
“marker contacts” that serve to identify the outer boundary. Par- 
ticipants must also learn how to prioritize contacts by determining 
which constitute the greatest threats to the defensive perimeters. 
There are often multiple contacts approaching both the inner and 
outer perimeter, so participants need to monitor both perimeters 
and gather information on the speed and distance of contacts in 
order to determine those that are the highest priority. Trainees also 
have to make strategic decisions about trade-offs between contacts 
approaching the inner and outer perimeters, based on the number 
of contacts at each perimeter and their “cost” if they penetrate. 


Manipulations 


Learners can be given control over a number of different aspects 
of their instruction, including content, sequence, and pace (Kraiger 
& Jerden, 2007). In the current study, all trainees were given 
control over what they chose to study and practice (content) and 
the order in which they chose to study and practice the material 
(sequence). In addition, they were given some control over the 
pace of their learning, such as being able to exit the online manual 
early; however, for design reasons, we set maximum time limits on 
the study and practice periods. Thus, trainees in all conditions were 
given the same level of objective learner control. 

At the beginning of the training session, participants in the 
no-guidance control condition were given a list of learning topics. 
They were told that the list covered all important aspects of the 
simulation and that they may want to focus on these topics during 
training, but what they chose to study and practice was at their 
discretion. Trainees in the no-guidance condition did not receive 
any guidance information. 

Trainees in the guidance conditions received the list of learning 
topics, along with guidance information that could be used to help 
them evaluate their current progress and improve their deficiencies 
in the different aspects of the simulation. As described below, the 
framing of this information depended on whether trainees were 
assigned to the controlling or autonomy-supportive condition. The 
guidance information was delivered following the last screen of 
feedback presented after each trial. The guidance manipulations 
created for the current study were modeled from prior research 
(Bell & Kozlowski, 2002). The guidance was “adaptive” because 
the suggestions for study and practice were tailored to participants’ 
proficiency in the simulation.’ The guidance focused on helping 
learners build basic skills early in training, before proceeding later 


' The guidance was adaptive based on three levels of performance. Pilot 
data were used to set cutoff scores at the 50th and 85th percentiles to 
differentiate among low, medium, and high performance on different task 
components. Learners were not aware of the cutoff scores. If individuals 
scored below the 50th percentile, the guidance informed them that they had 
not yet learned how to perform the necessary skill or strategy and provided 
practice and study suggestions for improvement. For those scoring between 
the 50th and 85th percentile, the guidance informed them that they had 
reached a level of minimal performance, but needed to become more 
proficient. The guidance also provided suggestions on what they should 
study and practice to improve. For individuals exceeding the 85th percen- 
tile, the guidance informed them that they had mastered the skill or strategy 
and should focus on other areas in which they were still deficient. 
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in training to developing more strategic competencies that build on 
the fundamental skills. 

The two guidance manipulations were created by framing the 
instructions for study and practice using language that either (a) 
was coercive and controlling (controlling guidance) or (b) empha- 
sized choice and self-initiated behaviors (autonomy-supportive 
guidance). The specific phrases were identical to those used in a 
number of earlier studies that manipulated autonomy-supportive or 
controlling contexts through task instructions (e.g., Vansteenkiste 
et al., 2004). Specifically, the controlling guidance manipulation 
used explicitly controlling language through phrases such as “you 
have to,” “you must,” “you should,” and “you had better.” For 
example, participants might be told, “You must study the material 
in your manual on prioritization strategies.” The autonomy- 
supportive guidance manipulation used instruction phrases such as, 
“you can,” “you might,” “you may,” and “if you choose.” For 
example, participants in the autonomy-supportive guidance condi- 
tion might be told, “You may want to study the material in your 
manual on prioritization strategies.” Other than the differences in 
the use of autonomy-supportive or controlling phrases, the two 
types of adaptive guidance were identical. 


9 66. 


9 66. > 


Measures 


Pretraining motivation. At the beginning of the experimental 
session, participants’ pretraining motivation was measured using 
seven items developed by Noe and Schmitt (1986).7 Items were 
modified to be consistent with our learning setting and were rated 
on a 5-point scale ranging from | (strongly disagree) to 5 (strongly 
agree). Sample items are “I am motivated to learn the skills 
emphasized in this training program” and “If I can’t understand 
something in the training program I will try harder.” Internal 
consistency reliability of the scale was .86. 

Cognitive ability. At the beginning of the experimental ses- 
sion, participants provided their SAT or ACT scores. Research has 
shown that the SAT and ACT have a large general cognitive ability 
component (Frey & Detterman, 2004). In addition, the publishers 
of these tests report high internal consistency reliabilities for their 
measures (e.g., KR-20 = .96 for the ACT composite score; Amer- 
ican College Testing Program, 1989) and self-reported SAT/ACT 
scores have been shown to correlate highly with actual scores. For 
example, Gully et al. (2002) found that self-reported SAT scores 
correlated .95 with actual scores. Individuals’ ACT or SAT scores 
were standardized using norms published by ACT and the College 
Board, and this standardized score was used as a measure of 
cognitive ability (College Board, 2011). 

Basic and strategic task performance. Using measures that 
have been established in previous research using the TANDEM 
simulation (e.g., Bell & Kozlowski, 2002), data were collected 
during each training trial that allowed assessments of participants’ 
performance on basic and strategic aspects of the task. Basic task 
performance was calculated on the basis of the number of correct 
and incorrect decisions during the trials, the two fundamental 
components of participants’ score. Performance on these two as- 
pects of the task is the result of knowledge of basic task compo- 
nents (e.g., decision-making cues and procedures). This measure is 
similar to task performance measures of accuracy often found in 
studies of choice effects on motivation (Patall et al., 2008). Stra- 
tegic task performance was composed of the number of times 
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participants zoomed out, the number of markers hooked in an 
effort to identify the location of an invisible outer perimeter, and 
the number of high-priority contacts processed during the practice 
trials. These indicators capture the two major elements of strategic 
performance: perimeter defense and contact prioritization. Past 
cross-sectional research supports the two-factor structure for the 
performance data using TANDEM (e.g., Bell & Kozlowski, 2002). 


Procedure 


Training was conducted in a single 3-hr session with groups of 
one to four participants. During this session, participants learned to 
operate the radar-tracking simulation described above. Participants 
were randomly assigned to one of three experimental conditions: 
controlling guidance, autonomy-supportive guidance, or a no- 
guidance control condition. 

Familiarization. Trainees were first presented with a brief 
demonstration of the simulation that described its features and 
decision rules and were shown the online instruction manual that 
contained complete information on all important aspects of the 
simulation. They then had an opportunity to familiarize themselves 
with the instruction manual for 3 min and were able to practice the 
task in a 5-min “familiarization” trial. The goal of this preliminary 
trial was to ensure that participants understood how to operate the 
instruction manual and were familiar with the equipment. 

Training. After the familiarization trial, trainees began the training 
session, which was divided into nine 10.5-min trials. Each training trial 
consisted of a cycle of study, practice, and feedback. Trainees had 3 
min to study the online instruction manual. They then had 5 min of 
hands-on practice. The nine trials possessed the same general 
profile (e.g., same difficulty level, rules, number of contacts), but 
the configuration of contacts (e.g., location and characteristics of 
contacts) was unique to each trial. Immediately after each practice 
trial, trainees reviewed veridical descriptive feedback on all as- 
pects of the task relevant to both basic and strategic performance. 
Trainees in all conditions received feedback, but only trainees in 
the guidance conditions received the adaptive guidance informa- 
tion following the last screen of feedback in each trial. Trainees in 
all conditions were given the same amount of time (2.5 min) after 
each practice trial to review their feedback and, if available, 
guidance information. Participants were given a 5-min break fol- 
lowing the third and ninth trials. 


Manipulation Checks 


At the end of training, all participants responded to a three-item 
measure of autonomous motivation adapted from Vansteenkiste et 
al. (2004). The items were assessed on a 5-point Likert scale that 


* Pretraining motivation was assessed with eight items adapted from Noe 
and Schmitt (1986). Prior to modeling the latent growth trajectories, we 
conducted an exploratory factor analysis for the scales. One reverse-coded 
item, “My primary goal for this experiment is just to finish it so I get my 
credit,” yielded loadings less than .20 on the pretraining motivation factor. 
Thus, this item was dropped from the measure. The utility of reverse-coded 
items is frequently debated among psychometric scholars (Hinkin, 1998). 
In addition to internal item quality issues, dropping the item is also justified 
on the basis of judgmental item quality concerns, given that the measure 
was adapted to the context and the item may have had different meaning 
with the respondent population (see Stanton, Sinar, Balzar, & Smith, 2002). 
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ranged from | (not at all true) to 5 (very true). A sample item is 
“I practiced the task because it was very interesting.” The reliabil- 
ity (coefficient alpha) of the measure was .93. We ran a hierarchi- 
cal regression analysis, controlling for participant’s pretraining 
motivation, to determine whether there were differences across the 
three conditions on the measure of autonomous motivation. We 
used one-tailed tests of significance due to the directional nature of 
our predictions. As expected, participants in the controlling con- 
dition (M = 2.73, SD = 1.21) reported significantly lower levels 
of autonomous motivation than participants in the no-guidance 
condition (M = 3.28, SD = 1.19), (129) = —2.19, p < .05, and 
marginally significant lower levels of autonomous motivation than 
participants in the autonomy-supportive condition (M = 3.01, 
SD = 1.11), 129) = —1.46, p < .10. Autonomous motivation did 
not differ significantly across the autonomy-supportive and no- 
guidance conditions, (129) = 0.92, p > .10, which is consistent 
with the fact that participants in both conditions were told they 
could choose what to study and practice. 

Given the subtle nature of the manipulation, we also examined 
the amount of time participants spent in the feedback sessions. 
Following each trial, participants could spend up to 2.5 min 
reviewing their feedback and, if available, guidance information. 
Participants in the no-guidance condition received only feedback, 
whereas participants in the controlling and autonomy-supportive 
conditions received both feedback and adaptive guidance informa- 
tion. Thus, if participants in the controlling and autonomy- 
supportive conditions reviewed the guidance information, we 
would expect them to spend more time overall in the feedback 
sessions. The amount of time (in seconds) participants spent re- 
viewing the pages containing feedback and guidance (if available) 
information across the nine trials was automatically recorded by 
the computer and was subjected to regression analysis, once again 
using one-tailed tests of significance. The results revealed that 
participants in the autonomy-supportive condition (M = 616.10, 
SD = 18.90) spent significantly more time in the feedback sessions 
than participants in the control condition (M = 418.77, SD = 
23.26), t(129) = 6.58, p < .01, as did participants in the control- 
ling condition (M = 600.33, SD = 21.23), t(129) = 5.77, p < .01. 
Time spent in the feedback sessions did not significantly differ 
across the two guidance conditions, (129) = —0.55, p > .10. 
Furthermore, analyses examining time spent on only the pages 
containing feedback information revealed that participants in 
autonomy-supportive condition (M = 378.47, SD = 14.37) spent 
significantly less time than participants in the no-guidance condi- 
tion reviewing feedback (M = 418.77, SD = 17.69), 11129) = 
—1.77, p < .05, as did participants in the controlling condition 
(M = 356.41, SD = 16.14), (129) = —2.61, p < .01. The two 
guidance conditions did not significantly differ in amount of time 
spent reviewing feedback, (129) = —1.02, p > .10. Together, 
these findings show that participants in the guidance conditions 
spent more time in the feedback sessions, and this increase was due 
to the time they spent reviewing the guidance, rather than feed- 
back, information. 


Analyses 


We used latent growth curve analysis (LCA) to analyze the 
repeated measures performance data. LCA is an extension of 
covariance structure analysis that invokes a confirmatory factor 
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analytic structure on the repeated variables measured over time, 
where the factor loadings for the latent growth constructs deter- 
mine the shape of the growth trajectories. This approach can give 
identical results to other growth modeling approaches (e.g... hier- 
archical linear modeling) but allows greater flexibility (Curran, 
2003). In particular, the latent growth curve framework allowed us 
to (a) test measurement invariance assumptions across time and (b) 
estimate growth across the three experimental conditions simulta- 
neously by specifying a multiple-group growth curve model. Hy- 
potheses were tested by sequentially imposing constraints on latent 
means (Hypotheses1, 2, and 3) and structural paths (Hypotheses 4 
and 5) and comparing nested models with the chi-square difference 
test (Bentler & Bonett, 1980). M-Plus was used to conduct all 
analyses (Muthén & Muthén, 2007). Performance measures were 
standardized across the nine trials. For all models, we specified 
autocorrelated error terms for performance scores at each time 
period because scores at adjacent time periods were nonindepen- 
dent. 


Results 


Table 1 reports descriptive statistics and intercorrelations among 
the study variables. Inspection of the means for the basic and 
strategic performance outcomes shows that participants improved 
over time, but at a decreasing rate. Table 2 presents the basic and 
strategic task performance means for each condition for each of the 
nine training trials. 


Nature of Performance Trajectories 


The first step in LCA is to describe the nature of change for all 
participants in the sample. Table 3 presents fit statistics and nested 
comparisons for alternate growth trajectories (i.e., no growth, 
linear, and quadratic growth) and error structures (i.e., homoge- 
neous or heterogeneous) for basic and strategic performance. The 
no-growth model included only a latent intercept mean and error 
term, whereas additional mean and error terms are included in the 
linear (i.e., intercept and linear terms) and quadratic (intercept, 
linear, and quadratic terms) models. Consistent with other longi- 
tudinal research on learning and performance during skill acqui- 
sition (e.g., Chen & Mathieu, 2008), the nested models in Table 3 
show that the quadratic growth specification best fit the longitu- 
dinal data. 

Table 4 presents the parameter estimates for the quadratic 
growth curve models. The latent factor means describe the average 
shape of performance growth across the nine trials for all partic- 
ipants. The positive linear factor means for basic (uw = 0.31, t = 
9.52, p < .001) and strategic (wu = 0.34, t = 11.17, p < .001) 
performance suggest that, on average, participants scored 0.31 and 
0.34 standardized points higher in each subsequent performance 
trial for basic and strategic performance, respectively. However, 
the significant negative quadratic factor means for basic (wu = 
—0.02, t = —4.44, p < .01) and strategic (uw = —0.02, t = —5.33, 
p < .001) performance suggest that the marginal rates of perfor- 
mance improvement were declining over time. Importantly, Table 
4 also shows significant variation around the intercept, linear, and 
quadratic factors. Thus, we next specified conditional latent curve 
models in order to predict the individual-level variation in perfor- 
mance trajectories and test the study hypotheses. 
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Table 2 
Means and Standard Deviations for Performance Dimensions Across Time and Experimental Conditions 
Autonomy-supportive guidance Controlling guidance No guidance 
Variable n M SD n M SD n M SD 
Basic task performance 
Time 1 53 aa 0.65 42 — (193 0.60 35 —0.90 0.54 
Time 2 53 = 0155 0.87 42 —0.43 0.78 35 —0.58 0.70 
Time 3 53 S031 0.88 42 —0.08 0.95 35 —0.44 0.89 
Time - 53 OOF 0.97 42 —0.02 0.87 35 —0.24 0.95 
Time > a3 0.02 1.00 42 0.14 0.90 35 0.01 0.81 
Time 6 53 0.21 0.95 42 0.36 0.89 35 0.23 0.80 
Time 7 a3 0.36 1.07 42 0.40 0.91 35 0.49 0.82 
Time 8 53 0.47 1.08 42 0.54 0.98 aS 0.44 0.84 
Time 9 53 0.61 0.89 42 0.66 0.87 35 0.49 1.03 
Strategic task performance 
Time 1 53 eal 0.55 42 —0.85, 0.41 35 0195 0.51 
Time 2 53 —0.65 0.42 42 —0.49 0.50 35 —0.68 0.46 
Time 3 53 Orie 0.76 42 =O 27, 0.70 35 —0.43 0.61 
Time 4 53 = 012348 0.81 42 0.20, 0.94 35 —=0:335 0.57 
Time 5 53 0.05, 0.85 42 0.43, 0.91 3b) Sa) lien 0.65 
Time 6 53 0275 0.96 42 0.76, 0.98 35 —0.03, 0.77 
Time 7 53 0.34, 1.02 42 0.96, 0.94 35 = O105e 0.89 
Time 8 53 0.50, 1.01 42 1.10, 1.02 35 0.18, 0.83 
Time 9 a3 0.49, 0.86 42 2a 1.03 35 O27 0.90 
Note. Items are standardized. Means with different subscripts are different at p < .05. 
growth than participants receiving controlling guidance. Figure 2 supportive guidance (wu = —0.01) was marginally less negative 
shows that, contrary to our prediction, we found that participants than for participants receiving controlling guidance (u = —0.02; 


receiving controlling guidance exhibited marginally more positive 
linear basic performance trajectories (uw = 0.38) than participants 
receiving autonomy-supportive guidance (uw = 0.28; Ay? = 3.51, 
Adf = 1, p < .10). Hypothesis 2 was not supported. However, 
because the quadratic factor means were negative, indicating a 
decelerating trend, a negative relationship between a predictor and 
a quadratic growth factor suggests that higher levels of a predictor 
are associated with less deceleration in performance over time. The 
quadratic factor mean for participants receiving autonomy- 


Ax? = 3.18, Adf = 1, p < .10), suggesting that learners’ receiving 
autonomy-supportive guidance improved in their basic task skills 
at a more consistent rate than did participants receiving controlling 
guidance. Figure 2 shows that the basic performance differences 
between participants receiving autonomy-supportive guidance and 
controlling guidance become smaller over time. 

Hypothesis 3 predicted that participants receiving controlling guid- 
ance would exhibit greater strategic performance growth than partic- 
ipants receiving autonomy-supportive guidance. Figure 2 (lower fig- 





Table 3 
Fit Statistics for Intraindividual Growth Trajectories 
Model xX? df CFI TLI RMSEA SRMR 
Basic performance 
No-growth heteroscedastic 455.37" * 35 0.56 0.55 0.30 0.84 
No-growth homoscedastic 605.09*"* 43 0.41 0.51 0.32 0.41 
Linear heteroscedastic 123197 By 0.90 0.89 0.15 0.17 
Linear homoscedastic 151.96**™* 40 0.88 0.90 0.15 0.16 
Quadratic heteroscedastic 47.27"™* 28 0.98 0.97 0.07 0.06 
Quadratic homoscedastic Tals 36 0.96 0.96 0.10 0.09 
Strategic performance 
No-growth heteroscedastic 926.92""* 37] 0.00 —0.01 0.43 2.92 
No-growth homoscedastic SOu Sam 43 0.20 0.33 0.35 0.76 
Linear heteroscedastic 178.06°** 34 0.83 0.82 0.18 0.15 
Linear homoscedastic 174.99*"* 40 0.84 0.86 0.16 0.19 
Quadratic heteroscedastic 46.25°** 30 0.98 0.98 0.07 0.08 
Quadratic homoscedastic 94.89*** 36 0.93 0.93 0.11 0.12 





Note. The degrees of freedom are different between some basic and strategic performance models. This was necessary because we found the strategic 
performance models with heteroscedastic error specifications arrived at improper solutions with negative uniqueness estimates for performances at Trial 
9. Given the small sample size, we followed the recommendations of Gerbing and Anderson (1987) and fixed this residual to zero, which has minimal 
practical influence on parameter estimates or fit statistics. CFI = comparative fit index; TLI = Tucker-Lewis index; RMSEA = root-mean-square error 
of approximation; SRMR = standardized root-mean-square residual. Boldface type indicates best fitting models. 
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Table 4 
Growth Curve Parameters for the Quadratic Models 





Basic performance Strategic performance 





Growth parameter Parameter t Parameter t 

Intercept 

Mean =0:89) —16.99""  —01954 23.73" 

Variance 0.20 On 0.11 1.99* 
Linear 

Mean 0.31 0a 0.34 IES Fi 

Variance 0.09 4.50" 0.09 4.86""" 
Quadratic 

Mean —0.02 —4,44"" —0.02 oso 

Variance 0.00 SO 0.00 5230 
Covariances 

Intercept with linear —0.03 —0.87 —0.02 08) 

Intercept with quadratic 0.00 1.00 0.00 1.14 
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ure) shows that, as expected, participants receiving controlling 
guidance (uw = 0.44) showed greater linear growth in strategic per- 
formance than participants receiving autonomy-supportive guidance 
(uw = 0.32; Ay? = 8.79, Adf = 1, p < .01). The strategic quadratic 
factors for controlling guidance (M = —0.02) and autonomy- 
supportive guidance (1 = —0.02) were not different (Ax? = 0.42, 
Adf = 1, ns). Thus, Hypothesis 3 was supported. 

Hypothesis 4 predicted that pretraining motivation will be posi- 
tively related to basic performance growth for participants receiving 
autonomy-supportive guidance, and this relationship will be weaker 
for participants receiving controlling guidance. Table 5 shows that 
participants’ pretraining motivation was positively related to basic 
linear growth in the autonomy-supportive guidance condition (B = 
.27, EST/SE = 2.32, p < .05) and negatively related to performance 
growth in the controlling guidance condition (8 = —.28, EST/SE = 
—2.21, p < .05). This difference was significant (Ax? = 15.84, Adf = 
1, p < .05) and is illustrated in Figure 3, where we plotted the 
interactive effects following Aiken and West’s (1991) procedures. 
Table 5 also shows that participants’ pretraining motivation was more 
strongly and negatively related to quadratic change (i.e., deceleration) 
for participants receiving autonomy-supportive guidance (8 = —0.03, 
EST/SE = —2.16, p < .05) than for participants receiving controlling 
guidance (8 = 0.02, EST/SE = 1.80, p < .10; Ay? = 13.75, Adf = 
1, p < .05). This suggests that participants receiving autonomy- 
supportive guidance with greater pretraining motivation were able to 
improve their basic performance scores at a more constant rate 
throughout the training. Finally, as expected, Table 5 shows that 
participants’ pretraining motivation was not significantly related to 
strategic performance growth in either guidance condition. These 
results support Hypothesis 4. 

Hypothesis 5 predicted that cognitive ability will be positively 
related to strategic performance growth for participants receiving 
controlling guidance, and this effect will be weaker for participants 
receiving autonomy-supportive guidance. Table 5 shows that abil- 
ity was positively related to linear strategic performance growth 
for participants receiving controlling guidance (8B = 0.14, 
EST/SE = 2.30, p < .05) but negatively and not significantly 
related to performance for participants receiving autonomy- 
supportive guidance (8B = —0.05, EST/SE = —0.60, ns). The 
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structural paths between ability and the linear growth factors were 
marginally different across the experimental conditions (Ax? = 
3.48, Adf = 1, p < .10). To help interpret the interaction effects 
across guidance conditions, we plotted the interactions using Ai- 
ken and West’s (1991) procedures (see Figure 4), using one 
standard deviation differences in participants’ ability. There was 
also a marginally significant negative relationship between ability 
and the strategic performance quadratic factor for participants 
receiving controlling guidance (B = —0.01, EST/SE = —1.89, p< 
.10), suggesting that higher ability participants receiving control- 
ling guidance were better able to sustain positive gains in strategic 
performance throughout the nine trials (see Figure 4). Ability was 
not related to the quadratic factor for participants receiving 
autonomy-supportive guidance (B = .00, EST/SE = 0.16, ns), and 
the two guidance conditions did not differ in the effect of the 
ability on quadratic change (Ax? = 1.88, Adf = 1, ns). Finally, as 
expected, Table 5 shows that ability was not significantly related 
to participants’ basic performance growth in either guidance con- 
dition. Overall, these results provide support for Hypothesis 5. 


Table 5 
Parameter Estimates Across Experimental Conditions 





Variable AG CG NG 


Basic performance 


Means 
Intercept = (188i 0102 ame OlOii 
Linear 0.28" = 0.38" 0.30" 
Quadratic —0.01" —0.02* —0.02* 


Structural paths 


Pretraining motivation — Basic intercept —0.01 0.07 0.02 


Pretraining motivation — Basic linear 0.27 —0.28" 0.07 
Pretraining motivation — Basic quadratic —0.03" 0.02’ —0.01 
Ability — Basic intercept OS 2a Ol 2a 0.13 
Ability — Basic linear —0.03 0.07 —0.04 
Ability — Basic quadratic 0.00 —0.01 0.00 
Strategic performance 
Means 
Intercept —(0,99" 0:88 B= OlO8e 
Linear 0.32" 0.44" 0.26" 
Quadratic —0.02* —0.02* —0.02* 
Structural paths 
Pretraining motivation — Strategic intercept —0.22' 0.12 —0.04 
Pretraining motivation — Strategic linear OS S0:15 0.06 
Pretraining motivation — Strategic 
quadratic —0.01 OO 0:01 
Ability — Strategic intercept 0.13 0.03 0.01 
Ability — Strategic linear —0.05 0.14" 0.04 


—0.01* 0.00 


Note. Basic performance model: x7(171, N = 130) = 238.63, CFI = 
0.941, TLI = 0.932, RMSEA = 0.096, SRMR = .095; Strategic perfor- 
mance model: x7(176, N = 130) = 226.96, CFI = 0.946, TLI = 0.939, 
RMSEA = 0.082, SRMR = .114. Modification indices suggested corre- 
lating Performance Trial 2 and 4 residuals in the NG basic model. Perfor- 
mance Trial 4 occurred immediately following a short break and may 
reasonably have impacted participants not receiving structured guidance. 
This change improved fit (Adf = 1, Ay? = 13.71, p < .01) in the basic 
model, but did not change any results in this study. AG = autonomy- 
supportive guidance; CG = controlling guidance; NG = no guidance. 
CFI = comparative fit index; TLI = Tucker-Lewis index; RMSEA = 
root-mean-square error of approximation; SRMR = standardized root- 
mean-square residual. Predicted relationships are bolded. 
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Figure 2. Mean basic and strategic performance trajectories across experimental conditions. A-S Guidance = 
Autonomy-Supportive Guidance; C Guidance = Controlling Guidance. 


Discussion 


Although educational institutions and work organizations are 
increasingly using computers to deliver instruction, learners 
often do not make good use of the control inherent in modern 
learning technologies (Brown, 2001). Prior research suggests 
that adaptive guidance can assist learners in making more 


effective learning choices and can enhance learning outcomes 
in technology-based instruction (Bell & Kozlowski, 2002). The 
current investigation provides further support for the utility of 
adaptive guidance, but more importantly it advances research in 
this area by showing that the effects of guidance may vary 
across different design features, learning outcomes, and learner 
profiles. In the following sections, we review the key findings 
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Figure 3. 


Influence of pretraining motivation on basic performance trajectories across experimental conditions. 


A-S Guidance = Autonomy-Supportive Guidance; C Guidance = Controlling Guidance. 
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Figure 4. Influence of ability on strategic performance trajectories across experimental conditions. A-S 


Guidance = Autonomy-Supportive Guidance; C Guidance = Controlling Guidance. 


of the current study and discuss their theoretical and practical 
implications. 


Key Findings and Theoretical Implications 


Prior research on adaptive guidance has tended to treat its 
effects on learning as static. To address this limitation, we used a 
longitudinal design and LCA to examine the effects of adaptive 
guidance on learning over time. The results revealed that learners 
who received adaptive guidance exhibited more positive change in 
their task performance over time than those who received no 
guidance, but this effect was limited to the effects of controlling 
guidance on strategic task performance. Adaptive guidance is 
designed primarily to impact the quality of learning (Bell & 
Kozlowski, 2002), so it is not surprising that its effects would be 
most pronounced for strategic performance outcomes, which are 
closely tied to processes related to learners’ attention and require 
the integration of concepts and the development of task strategies. 
Furthermore, although we expected that both autonomy-supportive 
and controlling guidance would lead to more positive strategic task 
performance change than no guidance, the observed pattern of 
findings support the argument that increasing the level of direction 
and constraining learner choices may enhance strategic learning 
outcomes by reducing demands on learners’ attentional resources 
and making it more likely that learners will come into contact with 
critical to-be-learned material (Kirschner et al., 2006; Mayer, 
2004). 

The direct comparison of autonomy-supportive and controlling 
guidance provided further evidence for the superiority of control- 
ling guidance in the current context. As expected, individuals 
receiving controlling guidance exhibited greater linear growth in 
their strategic task performance than those who received 
autonomy-supportive guidance. Contrary to our predictions, indi- 
viduals who received controlling guidance also exhibited margin- 
ally more positive basic task performance trajectories than those 
receiving autonomy-supportive guidance. It is important to note, 
however, that the basic performance trajectories of those in the 
controlling guidance condition showed a trend toward greater 
deceleration in performance growth than those in the autonomy- 


supportive guidance condition (see Figure 2). Thus, future research 
may investigate these findings further to determine whether guid- 
ance that emphasizes autonomy and choice may lead to higher 
levels of basic performance when learning is extended over a 
longer time frame, perhaps by sustaining individuals’ motivation 
and effort (e.g., Moller, Deci, & Ryan, 2006). Overall, however, 
these findings suggest that controlling guidance may be a more 
effective strategy for supporting skill development in more com- 
plex learning environments. Future research is needed to replicate 
and extend these findings, with particular attention devoted to 
examining the learning processes that may help further elucidate 
the effects of different guidance designs on various learning tasks. 

A final issue examined in the current study was the interactive 
effects of learner characteristics and guidance design on learning 
over time. Drawing on SDT and resource allocation theory, we 
argued that individual differences related to effort (pretraining 
motivation) and the availability of attentional resources (cognitive 
ability) may interact with autonomy-supportive and controlling 
guidance, respectively, to influence learning trajectories. As ex- 
pected, the results revealed that individuals with high levels of 
pretraining motivation exhibited greater growth in basic task per- 
formance when given autonomy-supportive rather than controlling 
guidance. Controlling guidance was detrimental to the basic task 
performance growth of individuals with high levels of motivation 
(see Figure 3), but interestingly it enhanced the performance of 
individuals with low levels of pretraining motivation (a finding we 
discuss more below). Overall, these findings suggest that 
autonomy-supportive guidance may support the natural expression 
of high-levels learning motivation, whereas controlling guidance 
may be effective for inducing effort from those trainees who have 
less positive initial attitudes and intentions toward training. 

We also found that ability interacted with guidance design to 
impact strategic task performance. Among those who received 
controlling guidance, there was a positive relationship between 
ability and strategic performance growth. These findings support 
our argument that controlling guidance enables learners to allocate 
more of their attentional resources toward study and practice 
activities that will allow them to master complex task elements. 
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However, when individuals received autonomy-supportive guid- 
ance, ability was unrelated to strategic performance. This is con- 
sistent with our hypothesis that increasing learner choice options 
may absorb or divert attentional resources that could otherwise be 
directed toward skill acquisition. 


Practical Implications 


The current study suggests that the relative advantage of 
autonomy-supportive instructional designs relative to controlling 
designs may be limited in more complex tasks, and motivational 
guidelines alone are not sufficient for instructional design. Instead, 
designers should consider the extent to which the instructional 
program aims to teach basic or strategic skills. For basic task 
performance, autonomy-supportive guidance had an advantage 
over controlling guidance, but only for learners who possessed 
high levels of pretraining motivation (i.e., learners 1 SD above the 
mean; see Figure 3). This is consistent with our argument that 
autonomy-supportive learning contexts facilitate, and controlling 
contexts thwart, the beneficial effects of pretraining motivation. 

Although controlling instructional designs are less frequently 
advocated, the current study showed a clear advantage for control- 
ling guidance over autonomy-supportive guidance for strategic 
skill acquisition. Learners receiving controlling guidance showed 
greater gains in strategic performance than participants receiving 
either autonomy-supportive guidance or no guidance (see Figure 
2). Furthermore, controlling guidance enhanced the positive rela- 
tionship between cognitive ability and strategic performance, 
whereas cognitive ability was not significantly related to perfor- 
mance improvements for those receiving autonomy-supportive 
guidance. Although unexpected, the greatest growth in basic per- 
formance was observed among participants who were low in 
pretraining motivation who were given controlling guidance in- 
structions. Together, these findings provide several examples of 
the potential utility of guidance instructions that are controlling 
instead of autonomy supportive. 

Katz and Assor (2007) pointed out that SDT is a theory of three 
human needs—autonomy, competence, and relatedness. Providing 
choice can have implications for learning if it changes the extent to 
which any of these needs are or are not satisfied. Katz and Assor 
(2007) noted the potential resource limitations associated with 
providing learners with autonomy during complex tasks and sug- 
gested that instructional designers might reduce the complexity of 
the task to match a person’s cognitive ability. On complex tasks, 
learners’ need for competence may be more salient than their need 
for autonomy. In the current study, controlling guidance informa- 
tion may have helped to conserve attentional resources that could 
be directed to learning important material, thus supporting learn- 
ers’ need for competence. Future research can investigate whether 
tailoring guidance to different needs (autonomy, relatedness, and 
competence) can enhance the beneficial effects of adaptive guid- 
ance on learning and performance across different learning con- 
texts. For example, Katz and Assor (2007) noted that providing 
choice to teams can impact relatedness needs. 


Limitations and Future Research Directions 


It is important to highlight a few limitations to the current 
research. First, the synthetic task and student sample may limit the 
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generalizability of our findings. Future research should extend our 
findings to different tasks, training spanning different lengths of 
time, and different instructional aids (e.g., intelligent tutors). Fur- 
thermore, future research should examine the relationships in other 
samples with varying levels of motivation and ability. For éxam- 
ple, future research extending our findings in a field study using a 
sample varying on demographic and individual-difference factors 
(€.g., age) associated with different levels of motivation and ability 
would have important practical implications. Alternatively, re- 
searchers could attempt to manipulate attentional resources in an 
experimental study by varying the task demands across perfor- 
mance trials. 

Second, Figure 3 reveals a negative relationship between moti- 
vation and basic performance for learners in the controlling guid- 
ance condition, which implies that the least motivated participants 
were acquiring basic skills at the fastest rate. This was an unex- 
pected finding and suggests that controlling guidance did not 
thwart the positive effects of motivation on basic performance 
acquisition, but reversed the motivational effect (i.e., it was ben- 
eficial for unmotivated learners). We speculate that learners who 
lack the motivation to engage in study decisions may have de- 
faulted to compliance, whereas learners with moderate levels of 
motivation may have reached a level of motivation that was 
sufficient to channel attentional resources away from the task. 
Future research is needed to first replicate and then extend this 
finding. Building on this finding, research may also consider other 
situations in which external control is preferable to intrinsic mo- 
tivation to learn (cf. Pintrich, 2003). 

In addition, our study design did not allow us to examine 
attrition from the training, which is an important practical problem 
for learner-control instructional designs (Sitzmann & Ely, 2010). 
This is an important consideration because scholars have found 
that controlling instructional designs may be associated with less 
task persistence than autonomy designs (e.g., Vansteenkiste et al., 
2004). Therefore, it is important that future research include mea- 
sures of attrition. Future research examining the impact of design 
features on attrition may also benefit by examining the type of 
motivation induced by the design features. For example, control- 
ling guidance instructions may facilitate motivation that is intro- 
jected (e.g., internal control such as avoiding guilt) or external 
(compliance, satisfying external demands), and the difference may 
be important for measures of persistence. Finally, future research 
may want to examine instructional designs that shift the focus of 
guidance over time. For example, guidance designs that shift from 
controlling to autonomy supportive as training progresses may 
facilitate the acquisition of complex skills while also sustaining 
learners’ motivation and effort over extended time frames. 


Conclusion 


A central issue facing learner-controlled educational technolo- 
gies is that learners often make poor use of the control they are 
given. Thus, instructional strategies such as adaptive guidance aim 
to help learners to better use the control by facilitating key moti- 
vational (e.g., effort) and cognitive (e.g., learning choices) pro- 
cesses. This article suggests that slight changes in the design of 
adaptive guidance interact with individual differences in pretrain- 
ing motivation and cognitive ability to impact the rate at which 
learners acquire basic and strategic task skills. Specifically, guid- 
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ance that was autonomy supportive appeared to facilitate (while 
controlling guidance reversed) the positive effects of pretraining 
motivation during basic skill acquisition. Guidance that was con- 
trolling was better for learning strategic skills, and appeared to 
facilitate the positive effects of cognitive ability on strategic skill 
acquisition. In contrast, when learners received guidance that was 
autonomy supportive, higher cognitive ability was not significantly 
related to the acquisition of strategic task skills. These findings 
highlight the importance of aligning the guidance design, individ- 
ual differences, and skill outcome in learner-controlled environ- 
ment. 
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A Selective Meta-Analysis on the Relative Incidence of Discrete Affective 


States During Learning With Technology 


Sidney D’ Mello 


University of Notre Dame 


The last decade has witnessed considerable interest in the investigation of the affective dimensions of 
learning and in the development of advanced learning technologies that automatically detect and respond 
to student affect. Identifying the affective states that students experience in technology-enhanced learning 
contexts is a fundamental question in this area. This article provides an initial attempt to answer this 
question with a selective meta-analysis of 24 studies that utilized a mixture of methodologies (online 
self-reports, online observations, emote-aloud protocols, cued recall) and affect judges (students them- 
selves, untrained peers, trained judges) for fine-grained monitoring of 14 discrete affective states of 1,740 
middle school, high school, college, and adult students in 5S countries. Affective states occurred over the 
course of interactions with a range of learning technologies, including intelligent tutoring systems, 
serious games, simulation environments, and simple computer interfaces. Standardized effect sizes of 
relative frequency, computed by comparing the proportional occurrence of each affective state to the 
other states in each study, were modeled with random-effects models. Engagement/flow was consistently 
found to be relatively frequent (d, = 2.5), and contempt, anger, disgust, sadness, anxiety, delight, fear, 
and surprise were consistently infrequent, with d, ranging from —6.5 to —0.78. Effects for boredom 
(d, = 0.19), confusion (d, = 0.12), curiosity (d, = —0.10), happiness (d, = —0.13), and frustration 
(d, = —2.5) varied substantially across studies. Mixed-effects models indicated that the source of the 
affect judgments (self vs. observers) and the authenticity of the learning contexts (classroom vs. 
laboratory) accounted for greater heterogeneity than the use of advanced learning technologies and 


training time. Theoretical and applied implications of the findings are discussed. 


Keywords: affect, emotion, learning, technology, meta-analysis 


As most students and teachers will attest, learning is an affec- 
tively charged experience. Students experience boredom when the 
material does not appeal to them (low perceived value), when they 
have little or no choice over the learning task, when they cannot 
cope with task demands because challenges outweigh skills, and 
when they are understimulated when skills outweigh challenges 
(Csikszentmihalyi, 1975, 1990; Daschmann, Goetz, & Stupnisky, 
2011; Pekrun, Goetz, Daniels, Stupnisky, & Perry, 2010). Students 
get confused when they have difficulty comprehending the mate- 
rial, when they encounter challenging impasses, and when they are 
unsure about how to proceed (D’Mello & Graesser, in press; 
VanLehn, Siler, Murray, Yamauchi, & Baggett, 2003). Frustration 
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occurs when students repeatedly make mistakes, when they get 
stuck, or when important goals are blocked (Kapoor, Burleson, & 
Picard, 2007; Stein & Levine, 1991). Students even experience 
despair and anxiety when their efforts seem futile and when the 
consequence of failure is high (Zeidner, 2007). 

This negative picture of the affective experiences that accom- 
pany learning has a complementary positive side. Students expe- 
rience interest and curiosity when they encounter novelty and 
topics that interest them (Berlyne, 1978; Hidi, 2006; Silvia, 2009), 
eureka moments when insights are unveiled and major discoveries 
are made (Parnes, 1975), delight when challenges are conquered 
and goals are attained (D’Mello & Graesser, 2011), and perhaps 
even flow-like states of intense engagement when there are clear 
learning goals, an appropriate balance between challenges and 
skills, and immediate feedback on actions (Csikszentmihalyi, 
1990). 

These examples provide a sketch of how affective states can 
arise during learning activities. Systematic research focusing on 
the link between affect and learning has been rapidly progressing 
over the last decade in the interdisciplinary arena that encompasses 
the fields of psychology (Deci & Ryan, 2002; Dweck, 2006), 
education (Lepper & Woolverton, 2002; Meyer & Turner, 2006; 
Pekrun, 2010; Schutz & Pekrun, 2007), artificial intelligence in 
education (Calvo & D’Mello, 2011; Conati & Maclaren, 2009: 
Woolf et al., 2010), and, more recently, neuroscience (Immordino- 
Yang & Damasio, 2007). Progress toward uncovering links be- 
tween affect and learning involves both theoretical development 


META-ANALYSIS ON AFFECT 


and empirical research on the affective states, the factors that give 
rise to these states (antecedents), and the impact of affect on the 
processes and products of learning (consequents). 

Some of the theories that have emerged in this area emphasize 
the importance of (a) appraisals of control and value of the learn- 
ing activity (Pekrun, 2010; Pekrun et al., 2010; Pekrun, Goetz, 
Titz, & Perry, 2002), (b) goal orientations (Hulleman, Durik, 
Schweigert, & Harackiewicz, 2008), (c) motivation and mind-set 
(Deci & Ryan, 2002; Dweck, 2006), (d) academic-risk taking 
(Clifford, 1988; Meyer & Turner, 2006), (e) interest development 
and maintenance (Ainley, 2008; Hidi, 2006), (f) the state of flow 
(Csikszentmihalyi, 1975, 1990), and (g) impasses, cognitive dis- 
equilibrium, and goal appraisals (Graesser, Lu, Olde, Cooper-Pye, 
& Whitten, 2005; Piaget, 1952; Stein & Levine, 1991). It is beyond 
the scope of this article to discuss each theory (even briefly), but 
it is important to emphasize that the various theories make a 
number of predictions on how affective states arise and influence 
learning outcomes. With the exception of test anxiety, which has 
dominated the scientific inquiry on affect during learning for the 
last several decades (Pekrun et al., 2010; Zeidner, 2007), there is 
a lot of theory and a dearth of data. Hence, several of these 
theories’ hypotheses remain untested, leaving many fundamental 
questions about how affective states arise, morph, decay, and 
impact learning outcomes largely unanswered. 

The last two decades have also witnessed an educational tech- 
nology revolution in the form of advanced learning technologies 
(ALTs) including intelligent tutoring systems (Aleven, McLaren, 
Sewall, & Koedinger, 2009; Beal, Arroyo, Cohen, & Woolf, 2010; 
Graesser, Chipman, Haynes, & Olney, 2005), animations and 
simulations (Ainsworth, 2008; Mayer, 2005), and immersive edu- 
cational games (Johnson & Valente, 2008; Sabourin et al., 2011). 
These systems all aim to positively impact learning by modeling 
student knowledge and engaging students in ways that far exceed 
the capabilities of the computer-assisted learning systems of the 
past (Corbett, 2001; VanLehn, 2011). The enhanced interactivity 
and human-like communication capabilities afforded by the ALTs 
are hypothesized to influence student affect in significant ways. 
Yet, with precious little data at hand, the affective impacts of 
technologically infused learning environments are not very well 
understood. Systematic research focused on answering basic ques- 
tions, such as identifying the specific affective states that students 
experience while interacting with learning technologies and un- 
covering how these affective states influence learning, is still in its 
infancy. 

There is also an engineering side to complement the scientific 
study of affect during learning. It has been suggested that one way 
to increase engagement and learning is to develop ALTs that can 
automatically detect and respond to student affect (du Boulay et 
al., 2010; Lepper & Woolverton, 2002; Picard, 1997). This is 
because it is presumed that affect is not merely incidental to 
learning but can also influence learning outcomes. As an example, 
consider boredom, a state that is negatively correlated with learn- 
ing (Craig, Graesser, Sullins, & Gholson, 2004; Forbes-Riley & 
Litman, 2011b; Pekrun et al., 2010), presumably because bored 
students have trouble focusing their attention and actively persist- 
ing in the learning task. Once boredom emerges, it tends to be 
quite persistent (Baker, D’Mello, Rodrigo, & Graesser, 2010), 
which reduces the likelihood that students will reengage with the 
material. Baker et al. (2011) have shown that off-task behavior can 
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alleviate boredom, but off-task behavior itself is detrimental to 
learning (Baker, Corbett, Koedinger, & Wagner, 2004). Further- 
more, bored students are more likely to experience frustration 
when they are forced to endure a learning session despite their 
ennui (D’Mello & Graesser, 2012); frustration is another affective 
state that is harmful to learning (Linnenbrink & Pintrich, 2002). 
Current ALTs primarily focus on the cognitive needs of the 
learner, so it might be the case that novel pedagogical and moti- 
vational strategies are required to inspire students to persist in 
learning despite the experience of negative affective states like 
boredom and frustration. Affect-sensitive ALTs are one way to 
achieve this goal. 

In its most basic form, an affect-sensitive or affect-aware ALT 
could automatically sense when a student is bored, confused, 
anxious, frustrated, and so on, and intervene accordingly. Fully 
automated affect sensing uses predictive models that infer student 
affect by analyzing the context of the session and interaction 
profiles of the student (Baker et al., 2012; Conati & Maclaren, 
2009; Sabourin, Mott, & Lester, 2011) and/or diagnostic models 
that sense affect from facial features, speech, postures, gestures, 
central and peripheral physiology, and textual responses (Chaoua- 
chi & Frasson, 2010; Pour, Hussein, AlZoubi, D’ Mello, & Calvo, 
2010). An affect-sensitive ALT has a number of paths to pursue 
once it has sensed a student’s affective state, although the ideal 
affect-response strategies are most likely tied to aspects of the 
immediate situational context. Some possible interventions include 
doing nothing if the student is engaged and is on a positive 
learning trajectory; providing hints and just-in-time explanations 
when confusion or frustration is detected; and providing choice, 
encouraging breaks, or adjusting levels of challenge with respect to 
difficulty when boredom is detected. These and other affect- 
sensitive interventions have recently been implemented and com- 
pared to nonaffective interventions in ALTs, such as AutoTutor 
(D’ Mello et al., 2010), ITSpoke (Forbes-Riley & Litman, 201 1a), 
and Gaze Tutor (D’ Mello, Olney, Williams, & Hays, 2012). Pos- 
itive effects of affect sensitivity on learning gains have been 
documented in some contexts, but the jury is still out on the 
effectiveness of affect-sensitive ALTs across a range of learning 
environments, subject domains, and student populations. 

In summary, the recent interest in exploring links between affect 
and learning, coupled with the emergence of affect-sensitive learn- 
ing technologies, raises important questions about the role of affect 
during learning with technology. It is clear that understanding 
which affective states students are more likely to experience in 
technologically rich learning contexts is an important first step 
toward enriching understanding of affect during learning and is an 
essential step toward the development of systems that intelligently 
handle student affect. Unfortunately, available data on this most 
basic issue of identifying the affective states that naturally arise 
during learning with technology are somewhat sparse and scattered 
and are in need of systematic synthesis. This article addresses this 
issue by providing a selective meta-analysis of 24 studies that have 
systematically monitored student affect during interactions with 
both basic and advanced learning technologies. The primary goal 
in this analysis is to assess whether a set of discrete affective states 
can be consistently identified across a diverse set of laboratory and 
classroom studies that vary the learning task, the learning domain, 
the learning technology, the students who are engaging in the 
learning task, and the methodology used to monitor affect. It might 
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be the case that some affective states are consistently observed 
across a variety of contexts, populations, and methodologies, 
whereas others are more closely coupled to these factors. As such, 
a secondary goal in this analysis is to identify the factors that 
predict the variability in the incidence of the affective states. 

It is important to emphasize that the present goal is not to 
compare affect in technology-infused versus more traditional 
learning contexts (e.g., classrooms or completing homework on 
pencil and paper). Quite different from this, the current focus is to 
assess the relative frequency of student affective states during 
learning with technology, determine the consistency of these rel- 
ative frequencies across studies, and identify factors that explain 
variability in the relative frequencies. It should also be noted that 
the present focus is on discrete affective affect measurement 
models (e.g., boredom, anger) instead of dimensional affect mea- 
surement models (e.g., valence and arousal). The use of discrete 
versus dimensional models for affect representation has been an 
ongoing debate in the affective sciences community for over a 
century (Lench, Bench, & Flores, 2013; Lindquist, Siegel, Quig- 
ley, & Barrett, 2013). In our view, studies that focus on either 
affect representational scheme can yield important insights into 
affect and learning; however, discrete models are better poised to 
afford actionable affective response strategies, which is a major 
goal of affect-sensitive ALTSs. For example, an ALT that senses 
that a student is frustrated can respond more specifically by giving 
hints, displaying empathy, and so on, than an ALT that detects 
negative valence (i.e., dimensional affect) but is unable to deter- 
mine if this is due to anger, confusion, frustration, or any other 
negative affective state. Of course, it is an open question if specific 
responses to each discrete affective state are needed, but this is 
currently the working hypothesis in the emerging field of affect- 
sensitive ALTs. 


Scope, Selection, and Description of Studies 


Scope of Analysis 


To appropriately contextualize this meta-analysis, note that the 
scientific research on affect during learning can be categorized into 
two separate strands of equal importance. These two research 
strands can be distinguished by virtue of scope, learning contexts, 
and methods used to track and analyze affect. The first strand 
focuses on a broad set of academic emotions (Pekrun, 2010), 
which include achievement emotions (e.g., frustration, anxiety), 
social emotions (e.g., pride, jealousy), topic emotions (e.g., empa- 
thy for a protagonist), and epistemic emotions (e.g., confusion and 
surprise). The dominant research methodology involves psycho- 
metrically grounded surveys that tap into a large set of variables 
that are hypothesized to be antecedents of affective states, such as 
achievement goals, situational interest, and self-concept. This line 
of research has yielded invaluable insights on affect and learning 
(see edited books by Schutz and Pekrun, 2007, and Pekrun and 
Linnenbrink-Garcia, in press, for examples) and has inspired the 
research community to probe deeper into the affective dimension 
of learning. 

The second research strand focuses on more in-depth analyses 
of a smaller set of affective states that arise during learning in more 
restricted contexts (e.g., computer labs in schools and laboratory 
studies) and over shorter time spans that range from 30 to 90 
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minutes (see edited book by Calvo and D’Mello, 2011, for exam- 
ple research studies). Most, if not all, of this research focuses on 
learning with some form of technology, such as intelligent tutoring 
systems, serious games, simulation environments, and basic com- 
puter interfaces for problem solving, reading, and writing. This 
line of research primarily focuses on the achievement and epis- 
temic emotions and sometimes the topic emotions. Social emotions 
are less relevant because most (but not all) of the research focuses 
on student—computer interactions rather than student—student in- 
teractions. Researchers in this group also use a more varied set of 
methodologies to track affect, such as online observations, emote- 
aloud protocols, cued recall, coding of video data, and physiolog- 
ical and behavioral instrumentation (e.g., facial feature tracking, 
galvanic skin response, posture sensors). 

The emphasis in this paper is on this second strand of research 
for a number of reasons. First, much of this research focuses on 
student affect during interactions with learning technologies, 
which is the main focus of this paper. Second, this research 
monitors affect at relatively fine-grained temporal resolutions 
ranging from seconds to minutes throughout a learning session. A 
fine-grained temporal resolution for affect sampling is important 
because coarser grained samples, such as measuring affect before 
and after a learning session, run the risk of overlooking the ebb and 
flow of dynamically changing affective states (Baker, Rodrigo, & 
Xolocotzin, 2007; D’Mello & Graesser, 2012). Monitoring affect 
at a fine-grained temporal resolution has the additional advantage 
of modeling the learning events that occur in close proximity with 
the affective states. For example, frustration after receiving nega- 
tive feedback from a computer tutor is quite different from frus- 
tration due to the poor speech synthesis of an animated pedagog- 
ical agent. Third, this research is characterized by a mixed-method 
approach to measuring affect instead of an exclusive focus on 
self-reports. There is no consensus as to the most accurate method 
to measure affect; hence, a mixed-method approach encompassing 
the students themselves as well as online observers or offline video 
coding represents the most defensible position. 

Another matter of scope pertains to how key terms such as 
learning activities, learning technologies, and affect are construed 
in this analysis. The present paper is quite inclusive in how these 
terms are used. Learning activities can range from text compre- 
hension, problem solving, and argumentative writing to interacting 
with simulations, serious games, and intelligent tutoring systems 
(ITSs). A learning technology is any computer system that serves 
an educational purpose. It can be a complex educational game or 
a simple interface to support self-regulated learning. Affective 
states are also broadly construed and are taken to encompass 
relatively quick (seconds to a few minutes) experiences of both 
bona fide emotions (e.g., anger, fear) and blends of cognition and 
emotion (e.g., confusion, interest, and states of engagement with 
mild positive affect) but not longer term mood states (e.g., depres- 
sion), dispositional affective traits (e.g., hostility), or motivational 
orientations (e.g., mastery approach tendencies). There is adequate 
theoretical justification to support this conceptualization of affect 
(Pekrun, 2010; Rosenberg, 1998; Silvia, 2009). 


Search, Inclusion Criteria, and Power Analysis 


Search. The studies were selected by searching (a) Interna- 
tional Journal of Artificial Intelligence in Education, Journal of 


META-ANALYSIS ON AFFECT 


Educational Psychology, Emotion, and Cognition & Emotion; (b) 
strictly peer-reviewed proceedings of the Intelligent Tutoring Sys- 
tems (ITS), Artificial Intelligence in Education (AIED), and Edu- 
cational Data Mining (EDM) conferences; (c) two edited books on 
emotions and learning (Calvo & D’Mello, 2011; Schutz & Pekrun, 
2007); and (d) the Education Resources Information Center 
(ERIC) database and PsycINFO with queries consisting of the 
terms affect, learning, and technology. An additional search with 
Googie Scholar was performed to obtain graduate theses and other 
publications not indexed by the major databases. Furthermore, 
some of the noted researchers who study affect and learning within 
the context of learning technologies were contacted for unpub- 
lished manuscripts. 

It should be noted that this informal search strategy was adopted 
because a more formal search (searching major databases with 
keywords) was not yielding suitable results. This might be partially 
due to the infancy of the field but is also likely due to the fact that 
many of the researchers in this area tend to present their work in 
strictly reviewed conferences that include published proceedings 
in the form of edited volumes, which are not always indexed in the 
major databases. There is some confidence that the present infor- 
mal search strategy, which included targeting specific outlets, 
uncovered most of the relevant articles, because the formal ap- 
proach of searching ERIC and PsycINFO rarely uncovered an 
article that was not discovered during the targeted search. 

Inclusion criteria. Twenty-four studies were selected for the 
analysis on the basis of the following criteria related to the learning 
context, method used to monitor affect, affect measurement model, 
sample size, and availability of data. 

With respect to the learning context, the only requirement was 
that the studies should involve interactions with some form of 
learning technology. A broad definition of learning technology 
was adopted, as discussed above. 

The following criteria pertained to the methodology used to 
monitor student affect. There was the requirement that student 
affective states should be tracked during the learning session. 
Studies that monitored affect only before and after a learning 
session were excluded. There was also the requirement that the 
affective states should be measured at a relatively fine-grained 
temporal resolution. Studies that tracked affect two or three times 
during a learning session were excluded because this does not 
afford sufficiently fine-grained monitoring of affect, which is the 
focus of this analysis. Additionally, studies that inferred affective 
states entirely on the basis of physiology, facial activity, EEG, and 
other signals such as posture, gesture, eye gaze, and acoustic 
features (e.g., Chaouachi & Frasson, 2010; Harley, Bouchet, & 
Azevedo, 2012) were also excluded due to open issues pertaining 
to the validity of fully automated affect measurement systems 
(Calvo & D’ Mello, 2010; Kappas, 2010; Zeng, Pantic, Roisman, & 
Huang, 2009). Finally, studies that tracked a single student across 
multiple learning sessions were not included because between- 
student error estimates, which are required for the statistical anal- 
ysis, are not available in these single-student studies. 

The affect measurement model pertains to the list of affective 
states measured as well as the instruments used to measure these 
states. The present focus was on discrete affective states for the 
reasons noted in the introduction. A majority of the available 
studies focused on discrete affect, so this inclusion criterion did not 
drastically reduce the pool of available studies. In line with this 
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requirement, only studies that recorded the presence or absence of 
discrete affective states (e.g., boredom, anger, frustration) at mul- 
tiple instances in the learning session were included. Studies that 
employed dimensional affect models, such as valence and arousal 
(e.g., Vuorela & Nummenmaa, 2004), and studies that grouped 
affective states into broad categories, such as positive, neutral, and 
negative (Litman & Forbes-Riley, 2006), were excluded. Studies 
that measured affect via Likert-type scales instead of binary self- 
reports (e.g., Arroyo et al., 2009; Conati & Maclaren, 2009) were 
also excluded because this introduced complications with respect 
to computing the effect size statistic (discussed in the Analysis and 
Results section). Similarly, studies that focused on the presence 
versus absence of a single affective state (e.g., Forbes-Riley & 
Litman, 2009) or on multiple levels of a single affective state, such 
as multiple levels of engagement but not any other state (e.g., Mota 
& Picard, 2003), were not included because this does not permit 
comparisons with other affective states. 

Studies were also selected on the basis of data availability. The 
following three units of information were required by the statistical 
approach adopted for the analysis: (a) number of students, (b) 
mean proportional occurrence of each discrete affective state 
across students, (c) standard deviation of proportional occurrence 
of each discrete affective state across students, and (d) correlation 
matrix of proportional occurrence of affective states. Studies were 
excluded if they did not report this information in published 
reports or if the author(s) of the candidate studies did not respond 
to requests to provide the necessary information. 

It is important to point out that the present analysis was intended 
to answer two very specific affect-related questions (see the intro- 
duction) in a very specific learning context (i.e., learning with 
technology). Therefore, very specific publication outlets were tar- 
geted and somewhat stringent inclusion criteria were adopted. The 
relatively informal nature of the search, which makes replication 
difficult, and the fact that only one author performed the search 
and applied the inclusion/exclusion criteria, does not meet recom- 
mended criteria for performing a comprehensive meta-analysis 
(Borenstein, Hedges, Higgins, & Rothstein, 2009; Hammerstr¢m, 
Wade, & Jgrgensen, 2010). Hence, as reflected in the title, the 
present paper should be considered to be a “selective meta- 
analysis.” 

Power analysis. A power analysis was conducted to deter- 
mine if the sample of 24 studies yielded adequate power to detect 
significant effects in relative affect incidence (relative because 
each affective state was compared to other affective states). A 
power analysis that used Cohen’s (1992) guidelines for small (d = 
0.2 sigma), medium (d = 0.5), and large effects (d = 0.8), Hedges 
and Pigott’s (2001) conservative heterogeneity estimates, Boren- 
stein et al.’s (2009) formulas for power analysis of main effects— 
assuming that each study had an N of 30—indicated that 22 studies 
would have adequate power (power = 0.8) to detect small (d = 
0.2) or larger effects with a two-tailed test with an alpha level of 
0.05. This power analysis is somewhat conservative, because the 
estimated N of 30 is substantially lower than the mean number of 
students in the studies (mean N was 73, as discussed below). 
Repeating the power analysis with an N of 73 indicated that 11 
studies were needed to detect small or larger effects and the 24 
studies could detect effects as small as 0.14 sigma. 
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Description of Studies 


Table 1 provides an overview of the methodologies of the 24 
studies that were analyzed. The subsequent discussion focuses on 
some of the key between-study differences. 

Student populations. Student samples exhibited diversity in 
terms of education level, age, ethnicity, country, and native lan- 
guage. The samples consisted of middle school, high school, 
college, and adult students from the United States, Canada, the 
United Kingdom, Philippines, and Australia. The number of stu- 
dents per study ranged from 7 to 260, with a median of 39, a mean 
of 73, and a standard deviation of 66. In all, affect data from 1,740 
students were analyzed. 

Training time. Training time refers to the amount of time 
students engaged with the learning technology. Training times 
ranged from 10 min to 90 min, with a median of 39 min, and a 
mean of 43 min (SD = 21). Data from most studies were collected 
in a single session, but some studies had multiple sessions. 

Learning contexts. The term /earning context encapsulates 
the learning setting, learning task, learning technology, and learn- 
ing topic. In terms of the learning setting, there was a mix of 
laboratory, school, and online studies. The school studies were 
usually conducted in computer laboratories rather than classrooms 
because technology was involved. Data from the three online 
studies with adults were collected with Amazon Mechanical Turk, 
which is a crowd sourcing platform that allows individuals to 
receive monetary compensation for completing human intelligence 
tasks online (http://www.mturk.com). Studies in which the primary 
learning activity was linked to the students’ classroom curricula 
and studies conducted in computer labs in schools are grouped in 
the “Authentic” versus “Laboratory” context category in Table 1. 

There was considerable diversity in terms of tasks, topics, and 
technologies. Learning tasks involved one-on-one tutoring ses- 
sions with ITSs, solving logic puzzles, interacting with simulations 
and serious games, developing reading comprehension, practicing 
for standardized tests, and developing writing proficiency. Topics 
(subject domains) covered included algebra, analytical reasoning, 
argumentative writing, chemistry, computer literacy (hardware, 
operating systems, the Internet), ecology, genetics, geography, 
graphing, logic puzzles, microbiology, pre-algebra, and social 
studies. The learning technologies included intelligent tutoring 
systems, serious games, simulation environments, virtual labs, and 
computer interfaces for problem solving, reading comprehension, 
and essay writing. Students individually worked with the learning 
technology in almost all of the cases (see Baker et al., 2011, for an 
exception). 

Methodologies to monitor affective states. Affective states 
were tracked with a number of methodologies including online 
self-reports, emote-aloud protocols, online observations, and ret- 
rospective coding of video after a learning session by the students 
themselves or by peers, observers, or trained judges. Online self- 
reports involved periodically polling the student for an affect 
report, whereas emote-aloud protocols asked students making 
spontaneous (i.e., nonprompted) verbal reports on their affective 
states as these were consciously experienced (Craig, D’Mello, 
Witherspoon, & Graesser, 2008). Online observations involved 
one or more coders making observations on student affect during 
a learning session (Rodrigo & Baker, 2011a). Cued-recall or 
retrospective affect judgment protocols involved collecting video 
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recordings of a student’s face and computer screen (to capture 
context) during the session and obtaining affect judgments over the 
course of replaying these videos after the session (Graesser, Chip- 
man, King, McDaniell, & D’Mello, 2007). The affect judgments 
for studies using these offline retrospective judgments were usu- 
ally provided by the students themselves, but other studies used 
observers, such as trained judges, peers, or teachers (Graesser et 
al., 2006). 

Affect sampling rates varied as a function of methodology and 
study. Several studies used a fixed sampling rate, where affect 
measurements were collected in regular intervals ranging from 
15 s to 7 min. Another option was voluntary measurements, in 
which affect reports were made as they occurred online or on the 
basis of offline video coding. Some studies even used a combina- 
tion of fixed and voluntary measurement, where affect measure- 
ments were collected at fixed intervals, but judges (or raters) were 
permitted to offer voluntary judgments between two fixed sam- 
pling points (e.g., Graesser et al., 2006). Some studies used an 
event-based sampling method, in which affect measurements were 
elicited to correspond to predetermined events (e.g., at the start of 
a new problem or topic, after completion of a problem or topic, or 
a few seconds after receiving feedback; e.g., Graesser et al., 2007). 
Table 1 lists sampling rates when available, which was usually for 
studies that used fixed sampling methods or fixed + voluntary 
sampling methods. It is impossible to compute precise sampling 
rates for event-based and other methods, so these have simply been 
listed as “Varied” in the table. 

Affective states considered in the studies. A total of 17 
affective states plus neutral were tracked in the 24 studies (see 
Table 2 for an alignment of affect by study). These include anger, 
anxiety, boredom, confusion, contempt, curiosity, delight, disgust, 
engagement/flow, eureka, excitement, fear, frustration, happiness, 
interest, sadness, and surprise. Twenty studies included a neutral 
category, four studies incorporated an “other” category, and one 
study included a “none” category. The number of affect states 
included in each study (including neutral, other, and none) ranged 
from 5 to 15 with a mean of 8 states (SD = 3) and a median of 7 
states. 

The definitions of most of these affective states are well known, 
but engagement/flow requires some clarification. Quite different 
from passively attending to a task, engagement/flow is conceptu- 
alized as a state of mild positive affect when involved with a task, 
such that concentration is intense, attention is focused, and focus is 
complete. However, it may not involve some of the aspects of 
Csikszentmihalyi’s (1990) conceptualization of flow that refer to 
extreme intensity to the extent that there is time distortion or loss 
of self-consciousness. 

It should also be noted that fear and anxiety are related but 
distinct constructs, as argued by Ohman (2008). 


Analysis and Results 


Encoding and Computing Effect Size Statistics 


There was a difference in the set of affective states monitored in 
each study, and some states were included in only a handful of 
studies. A power analysis with power = 0.8, a = 0.05, and N of 
73 (this is the mean N across the 24 studies) indicated that a 
minimum of 4 studies was needed to detect at least a small- to 
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Method Rate® Reference 


No. 
A? 


Country 


No. 
‘Sa Level 


Min? 


Setting* 


Domain 


Technology 


Task 


No. 


Sabourin et al. 


420s 


55 260 Middle USA 7 Online self-report 


Microbiology and genetics Authentic 


Crystal Island 


Narrative-centered 


18 


(2011) 


learning 


environment 
Interaction with ITS AutoTutor 


Varied Craig et al. 


Emote-aloud 


8 


USA 


7 College 


Laboratory 


Computer literacy 


19 


(2008) 
Rodrigo & Baker 


S 


Observational 60 


7 


Logic puzzles Authentic 36 High Philippines 


Incredible Machine 


Interaction with 


20 


(201 1a) 


Varied Lehman et al. 


serious game 
Interaction with ITS Operation ARIES! 


9 Retrospective (self) 


College USA 


31 


Laboratory 


Research methods 


(2011) 


Varied Strain & 


Online self-report 


25 138 Adults USA 


Laboratory 


Social studies 


Web interface 


Reading 


22 


D’Mello 
(2011) 


Baker et al. 


comprehension 


131s 


Observational 


4 


USA 


45 25) Collese 


Authentic 


ChemCollective: Chemistry 


Virtual laboratory 


23 


(2011) 


Baker et al. 


Virtual labs 


Interaction with ITS Cognitive Tutor 


3325 


Observational 


89 High USA 4 


90 


Authentic 


Algebra 


24 


(2012) 





intelligent tutoring system. 
“Denotes whether affect was tracked in an authentic setting, where students were monitored in classrooms (most studies), or in a laboratory setting, where they were monitored in a lab (including 


ITS 


Note. 
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© Affect sampling rate when applicable in seconds, with Varied 


°No. S = number of students. “No. A = number of affective states. 


» Min = length of learning session in minutes. 


indicating no fixed rate. 


online studies). 


medium-sized effect (d = 0.35) under conditions of severe heter- 
ogeneity. As such, states that were present in fewer than four 
studies were excluded from the subsequent analyses. These in- 
cluded excitement, interest, and eureka. 

The critical dependent variable for an affective state was the 
proportional occurrence of that state for a given student in a given 
study. Hence, the sum of proportions for a single student would 
add up to 1. Some studies utilized multiple judges to assess 
students’ affective states (Alhothali, 2011; D’Mello & Graesser, 
2011). This can violate independence assumptions, so proportional 
scores were averaged across the multiple judges prior to comput- 
ing the effect sizes. 

Two sets of standardized effect size measures were computed 
for each affect state (target affective state) from these proportion 
scores. The first was an overall effect size, which was the stan- 
dardized mean difference (Cohen’s d) between the proportion of 
the target affective state compared to the average proportion of the 
other states. This would yield one standardized overall effect size 
for each affective state in a study. There would be 350 effect sizes 
(14 states) if every study used the same affect labels. However, this 
was never the case, and when summed across studies, there were 
159 overall effect sizes. 

The second set of effect sizes, called pairwise effect sizes, was 
the standardized mean difference (Cohen’s ds) when the propor- 
tion of each affective state was compared to the proportions of 
each other state in the study. Hence, a study with e affective states 
would yield [e X (e — 1)/2] pairwise effect sizes. Overall, there 
were 596 pairwise effects across the 24 studies. 

It is important to note that the two effect size measures provide 
similar information on the relative incidence of each affect state, 
albeit at different levels of granularity. The overall effect size 
provides a coarse-grained assessment of the relative incidence of a 
target affect state by comparing it to the average of the other 
affective states. The pairwise effect size provides a finer grained 
assessment because the target affective state is compared to each 
other affective state in the study. Both measures are not affected by 
methodological differences in terms of affect measurement models 
across studies, because proportional scores were compared only 
within an individual study. That is, the sum of the proportions of 
the affective states in a study (including neutral, none, and other) 
always summed up to 1, so within-study differences, such as 
including a different list of affective states, would not affect the 
results. Additionally, a given affective state could occur with the 
same proportion in two studies but could have different effect sizes 
(both overall and pairwise), based on the other states included in 
the study. This is exactly what is needed because the metric of 
interest is “relative frequency,” which should vary as the affective 
states considered in each study vary, as opposed to an absolute 
measure, which should be the same across studies. Furthermore, a 
standardized (instead of a raw) effect size estimate was used in the 
analyses, which is widely recommended in the literature because it 
makes it feasible to aggregate effects across studies (Borenstein et 
al., 2009; Lipsey & Wilson, 2001). That being said, the analyses 
were repeated with an unstandardized effect size metric (the raw 
proportions for each state) and the same major patterns were 
discovered. 

A fully automated approach was adopted in order to minimize 
errors associated with computing and analyzing the effect sizes. 
First, the authors of the 24 studies were contacted with a request to 
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Table 2 
Alignment of Affective States by Individual Study 
Study no. 

Affect i ee OMS Ft LOn dl 12) 13 RIAadS: 16.0178) 18.4 19 , 20 21, 22, 2324 AN sindies 
Anger ae Sti + se aa 6 
Anxiety “ a a ap Sar a 7 
Boredom Se ee i ee ee a ee ee ee ee 24 
Confusion Snes es a er oe ee cee clair team er stamiesiae » tua ectaapeccee aust mustcciaee® Uasteeaea 24 
Contempt af + a ae 4 
Curiosity fee + + oo + ae + ae 9 
Delight + + + + + + + + + + + + + + 14 
Disgust + + a die ae 5 
Engagement/low + + + + +4+ + + ape PS Se SP ar Ge Rae a ar 20 
Eureka + a ae 3 
Excitement sis 1 
Fear ote =F + + 4 
Frustration of) SUEY PSR SE Pe OST SPR AORERS GR e RED sh eR Seediectoay Bs 23 
Happiness ts + + Be aes ae 7 
Interest aly 
Sadness => - eget 4 
Surprise + + + + + + + + + + ~«+4F + + + + + 16 
Neutral SE OGE AMR (iets Gee) Ge UE) ee 4s Pee ae le + + + 20 
Other ++ + = eee 4 
None Si 1 
N states ae a Om eas eae So oO ee Ua Ogee ee Se LS boy, Se ee!) Oe ee) 

Note. Please see Table | for details on individual studies indexed by study number. 


“Study methodology permitted multiple affective states to be simultaneously reported. 


provide student-level proportionalized affective state data. This 
consisted of ann X e (student X affective state) matrix with cell 
(i, j) representing the proportional occurrence of affective state j 
for student 7. A computer program was developed to compute the 
within-subject standardized effect size and its variance with for- 
mulas specified in Borenstein et al. (2009). After accuracy was 
verified, the program was used to automatically compute the 159 
overall and 596 pairwise effect sizes along with their variances. 

An examination of the effect size distributions revealed a few 
outliers that would potentially skew the results. Outliers were 
identified as ds that exceeded three standard deviations from the 
mean and were removed as recommended by Lipsey and Wilson 
(2001). A total of 3 overall effects (1.89%) and 26 (4.5%) pairwise 
effects were identified as outliers and removed. 

The metafor package (Viechtbauer, 2010), which is a validated 
library for conducting meta-analyses within the R environment, 
was used for all subsequent analyses. Restricted maximum likeli- 
hood estimation (REML) was used for model fitting. Both overall 
and pairwise effects were analyzed and are reported here; however, 
the emphasis is on overall effects, because they provide a clearer 
picture on the relative incidence of the affective states. 


Random-Effects Models on Overall Effect Sizes 


A random-effects model was used to model the effect size 
distributions due to the considerable between-study variability in 
student populations, learning technologies, and methodologies. A 
random-effects model assumes that the true effect varies across 
studies due to between-study variance (Tau*) and within-study 
variability. Total variance for each study is the sum of between- 
(constant for all studies) and within-subject variability. 


The homogeneity statistic (Q; Cochran, 1954) was used to test 
whether the true effect sizes were consistent (homogeneity) or 
varied across studies (heterogeneity), with significant Qs being 
indicative of heterogeneity. With the exception of that for con- 
tempt, Qs for all effects were significant (p < .05 unless with 
two-tailed tests specified otherwise), thereby indicating heteroge- 
neity in effect size distributions (see Table 3). The table also lists 
the [° statistic, which is the percent of variance that can be 
attributed to true heterogeneity (Borenstein et al., 2009). I? for the 
heterogeneous effect sizes (i.e., those with nonsignificant Qs) 
ranged from 63.2% to 99.7%, which indicates that much of the 
variance was caused by real between-study differences instead of 
random error. These between-study differences are explored with 
moderation analyses in the next section. 

Weights were assigned to each study based on inverse variance 
weighting, and the key summary measure was the weighted mean 
effect size (d,). For affective state e, d, indicates the weighted 
mean (across studies) of the effect sizes of e, where each effect size 
(d) is the standardized mean difference (Cohen’s d) between the 
proportion of e and the average proportion of the other states in 
each study. Descriptive statistics (in descending order of d,.) along 
with significance tests of the main effect (d_.) for the 14 affective 
states are presented in Table 3. The results supported a four- 
category grouping of affective states in terms of the magnitude and 
significance of the weighted mean effect sizes. Engagement/flow 
was the only state that yielded a significant and positive d.. of 
2.46. Weighted mean effects for boredom (d= 0.19) and confu- 
sion (d, = 0.12) were consistent with small nonsignificant positive 
effects. Curiosity (d,= —0.10) and happiness (d,= —0.13), on 
the other hand, yielded small nonsignificant negative effects. The 
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Table 3 
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Descriptives of Effect Sizes and Heterogeneity Analysis Based on Random Effects Model 


Descriptive statistics of d, Range of d Heterogeneity 
95% CI d< a = : $ : 
Affective state ke M (SE) Lee ey Pp =?) =1(0)2) 0.2 Q Tau I 
Significant and positive 
Engagement/flow 19 2.5 (0.46) [1.6, 3.4] 5:37 <.01 0.00 0.00 1.0 213.4 3.6 97.4 
Not significant and positive 
Boredom 21 0.19 (0.27) [—0.34, 0.71] 0.69 49 = 0.33 0.05 0.62 259.4 1:3 95.4 
Confusion 20 0.12 (0.28) [—0.43, 0.66] 0.42 67 ~—0..40 0.15 0.45 228.1 1.4 96.3 
Not significant and negative 
Curiosity 9 —0.10 (0.32) [0573,,052] (133 74 ~=0.44 0.22 0.33 ioe 0.7 OED 
Happiness 7 —0.13 (0.22) [—0.55, 0.30] —0.60 550.43 0.29 0.29 36.0 0.3 86.4 
Significant and negative 
Contempt 4 —0.78 (0.14) ak, O50) —5.47 <.01 1.0 0.00 0.00 aS 0 0 
Anger 6 —1,2 (0.34) [tos 055] So OL 1.0 0.00 0.00 28.7 0.4 80.6 
Disgust 3 —1.5 (0.44) [23701611] = 3.35) <.01 1.0 0.00 0.00 17.0 0.7 89.2 
Sadness 4 =i) Os) [2.05.0] (5,113) <.01 1.0 0.00 0.00 Dil 0.1 63.2 
Anxiety of —1.5 (0.68) 29 OLE Sh =2.23 03 ~=0.71 0.29 0.00 138.5 3.1 98.1 
Delight 12 2 (0937) [—2.8, —1.4] =o << {0 1.0 0.00 0.00 88.8 1.5 90.8 
Frustration 21 — es) (S10) [—4.6, —0.38] 223i 1, Upp 0.19 0:29" 378:9 23 99.7 
Fear 4 —2.7 (0.64) [=4:05— 5] —4.29 <.01 1.0 0.00 0.00 2D 1.4 89.7 
Surprise 14 G1 (2:0) lO 2:6) 3.3.0 ee Ol O93 0.00 0.07 330.8 52:3 99.4 
Note. SE = standard error; CI = confidence interval; LL = lower limit; UL = upper limit. 


*k is the number of studies. 


remaining nine affective states (anger, disgust, sadness, anxiety, 
contempt, delight, frustration, fear, and surprise) had significantly 
negative d,s ranging from —6.5 to —0.78. 

Although d, provides a useful summary measure of the overall 
relative incidence of each affective state across all the studies, it is 
somewhat insensitive to the nuances in the individual effect size 
distributions. This was investigated by categorizing the individual 
effects on the basis of Cohen’s (1992) proposed convention of 0.2, 
0.5, and 0.8 sigma representing small, medium, and large effects, 
respectively. In particular, each effect was grouped into one of 
the following three categories: (a) small or larger negative 
effect (d < —O.2), (b) negligible effect (—0.2 = d = 0.2), and 
(c) small or larger positive effect (d > 0.2). The proportion of 
studies falling into each of the three categories is presented in 
Table 3. This analysis indicated that the effect sizes of engage- 
ment/flow were always greater than 0.2, and the effect sizes of 
contempt, anger, disgust, sadness, delight, and fear were con- 
sistently less than —0.2. With minor exceptions, the effect sizes 
for surprise and anxiety were less than —0.2. The results for 
these nine states were consistent with the patterns of the 
weighted mean effect sizes. 

The data were more interesting for boredom, confusion, curios- 
ity, happiness, and frustration. Boredom and confusion were two 
affective states with positive nonsignificant d,s, yet ds for these 
states alternated between small or larger negative (d < 0.2) and 
positive (d > 0.2) effects. Effect sizes for curiosity and happiness, 
which were two states with nonsignificant negative d,s, alternated 
among all three categories (d < —0.2; —0.2 =d = 0.2; d> 0.2). 
The data for frustration were somewhat surprising, as the d,s for 
this state was significant and negative, but ds > 0.2 were observed 
for approximately one third of the studies. Taken together, it 


© All Qs significant at p < .001 except for contempt, where p = .048. 


appears that the low d.,s for boredom, confusion, curiosity, hap- 
piness, and frustration should not be attributed to consistently low 
ds across studies but rather to considerable between-study vari- 
ability in ds. This is different from the other nine states that 
consistently yielded positive effects (flow/engagement) or nega- 
tive effects (contempt, anger, disgust, sadness, delight, fear, anx- 
iety, and surprise). 


Mixed-Effects Models on Overall Effect Sizes 


Mixed-effects models assume that a portion of between-study 
variability can be modeled by considering systematic differences 
between studies (i.e., moderators). The four study-level modera- 
tors that were considered were self-report, authentic context, ALT 
used, and training time (see Table 1 for details on these variables). 
Self-report was a dichotomous variable that indicated whether the 
affect measurements were provided by the students themselves 
(coded as 1) or by an observer (coded as 0), such as a trained 
judge, a researcher, or an untrained peer. Authentic context was 
also a dichotomous variable; it was coded as 1 if the learning 
context occurred in a school setting and as 0 if the context occurred 
in laboratory studies. ALT was dichotomous and represented 
whether the learning environment was an advanced learning tech- 
nology, such as an ITS or a serious game (coded as 1), or a simpler 
computer interface (coded as 0). Finally, training time was a 
continuous variable that represented the length of the training 
session in minutes. 

Additional study-level variables from Table 1 (i.e., learning 
task, learning technology, task domain, population, and country of 
students) were not considered because these categorical variables 
had too many levels, thereby making it difficult to derive mean- 
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ingful models given the constraints of the data. Student population 
was not included as an independent variable because it was highly 
correlated with authentic context. That is, college students were 
the primary participants in laboratory studies, but middle and high 
school students made up the samples of studies in more authentic 
learning contexts (i.e., in schools). 

Tolerance values for the four moderator variables ranged from 
.361 to .451. This suggests that there were no severe multicol- 
linearity problems, because tolerances exceeded or were very close 
to the recommended value of 0.4 (Allison, 1999). 

Mixed-effects models were constructed only for the relatively 
more frequent affective states: engagement/flow, boredom, confu- 
sion, curiosity, happiness. Though relatively less frequent, frustra- 
tion was also included in the analyses due to the interesting 
between-study variability in effect size distributions for this state 
(see previous section). Separate models were constructed for each 
moderator in order to individually assess its predictive power. This 
resulted in 24 models (6 affective states X 4 moderators), as shown 
in Table 4. 

Significant models were discovered for engagement/flow, bore- 
dom, confusion, and frustration but not for curiosity and happiness, 
presumably due to the small sample size for these two states. An 
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examination of the model coefficients indicated that students were 
less likely to self report being in the engaged/flow state, but they 
were more likely to report being bored and frustrated. This sug- 
gests that the source of the affect judge (self vs. observers) can 
impact what is being reported. It might be the case that observers 
have difficulty in detecting boredom and frustration due to subtle 
and conflicting cues associated with natural displays of these 
affective states. An analysis of facial expressions of boredom and 
engagement/flow has indicated that, compared to states like con- 
fusion and delight, these affective states have more subtle facial 
markers (McDaniel et al., 2007). This would make it difficult for 
observers to detect these states. Frustration provides an interesting 
case study for emotion perception because natural frustration is 
sometimes expressed with brief half smiles (Hoque & Picard, 
2011), which are often confused with emotions like delight (Mc- 
Daniel et al., 2007). 

Studies conducted in authentic learning contexts were associ- 
ated with more engagement/flow and less boredom and frustration. 
This is an expected finding, because motivation and engagement 
are likely to be higher when the learning task is aligned with 
students’ schoolwork to the extent that students perceive intrinsic 
value in the learning activities. In contrast, boredom was higher in 


Table 4 
Model Summaries and Standardized Coefficients of Mixed-Effects Models 
Affective state and moderator k B (SE) 
Engagement/flow 
Self-report 19 —3.07 (0.58) 
Authentic context 19 2.87 (0.68) 
ALT used 19 1.82 (0.87) 
Min 19 0.00 (0.03) 
Boredom 
Self-report 21 1.69 (0.41) 
Authentic context 21 —1.79 (0.38) 
ALT used Zu —1.15 (0.47) 
Min Pal 0.01 (0.02) 
Confusion 
Self-report 20 —0.02 (0.61) 
Authentic context 20 0.18 (0.62) 
ALT used 20 0.50 (0.56) 
Min 20 0.03 (0.01)* 
Curiosity 
Self-report — ad 
Authentic context 9 0.79 (0.95) 
ALT used 9 0.83 (0.61) 
Min 9 0.00 (0.02) 
Happiness 
Self-report 7 —0.39 (0.88) 
Authentic context — — 
ALT used os a 
Min d —0.03 (0.03) 
Frustration 
Self-report 21 7.04 (1.58) 
Authentic context 2) —6.70 (1.71) 
ALT used 21 —3.91 (2.10)° 
Min 21 —0.04 (0.06 





Coefficients Heterogeneity 
95% CI [LL, UL] Zi OF OM! Tau? 
[421592] 26 97.1 Dei) 1.16 
[1.54, 4.20] 4.24 196.0 17.94 eal 
O12, Sil 2.10 197.7 4.40 3.05 
[—0.06, 0.05] SBIR 213.4 0.01 3.92 
[0.88, 2.49] 4.11 179.5 16.88 0.69 
le 22S ele OA] —4.69 103.3 21.99 0.56 
[—2.07, —0.24] —2.47 142.0 6.08 0.98 
{—0.02, 0.04] 0.49 226.7 0.24 1.42 
Fees ein] —0.03 D2 Re 0.00 1.48 
[= 1035139) 0.29 207.5 0.09 1.47 
[—0.60, 1.59] 0.89 195.6 0.78 1.39 
[0.00, 0.05] 1.84 161.3 3.38 1.19 
[—1.07, 2.66] 0.83 58.5 0.69 0.77 
[—0.38, 2.03] 1.35 53:3 1.81 0.63 
{[—0.05, 0.05] —0.03 72.4 0.00 0.86 
[—2.13, 1.34] —0.45 35.9 0.20 0.29 
[—0.08, 0.03] —0.98 26.6 0.96 0.26 
(3.94, 10.15] 4.45 196.0 LOTT TAROT 
[—10.06, —3.34] ool 362.2 15'31 13.05 
{[—8.03, 0.21] les 367.1 3.47 2AF 
PS Onsn Oar) 3 377.9 0.54 24.32 


Note. Blank cells (—) for coefficients indicate that moderator was not included in the model due to data availability. Significant coefficients (at p < .05) 
are bolded at p < .05. SE = standard error; CI = confidence interval; LL = lower limit; UL = upper limit, ALT = advanced learning technology, Min = 


length of learning session in minutes. 
2 pinO7..  pan06, 
significance of coefficients). 


° QE = test for residual heterogeneity and was always significant at p < .01. 


“ OM = test of moderators (significance mirrors 
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laboratory studies, presumably because students perceive little 
value in learning activities that are unrelated to their educational 
goals. Indeed, perceived value and goal congruence are considered 
to be key predictors of engagement and curiosity (Hidi, 2006; 
Pekrun et al., 2010). 

The use of ALTs was also associated with lower rates of 
boredom and frustration coupled with an increase in engagement/ 
flow. This might be attributed to the interactive nature of the 
technologies; their customized instruction via sophisticated student 
modeling; the presence of immediate, direct, and discriminating 
feedback; or to simple novelty effects. Finally, training time was a 
significant positive predictor of confusion, but it did not signifi- 
cantly predict any of the other states. 

Percent reduction in heterogeneity for each moderator was com- 
puted as 100 * [(Tauz, moa — Taurith moa)/Tauzs moal Tau? with and 
without moderators were taken from Tables 2 and 3, respectively. 
The source of the affect judge (self vs. observers), the authenticity 
of the learning context, and the use of ALTs yielded heterogeneity 
reductions of 57%, 52%, and 18% averaged across flow/engage- 
ment, boredom, and frustration. Training time reduced heteroge- 
neity of confusion by 14%. 


Random-Effects Models on Pairwise Effect Sizes 


The pairwise effect size distributions were also analyzed with 
random-effects models to complement the previous analyses on 
the overall effect sizes. Analyses were performed only for effect 
size distributions with at least 4 data points, in light of the power 
analysis discussed above. Standardized weighted mean pairwise 
effect sizes are shown in Table 5. 

The results largely replicated the patterns with overall effect 
sizes. Engagement/flow was relatively more frequent than all the 
other states. Boredom was relatively less frequent than engage- 
ment/flow, not significantly different from confusion, and rela- 
tively more frequent than the other affective states. Confusion was 
relatively less frequent than engagement/flow; statistically equiv- 
alent to boredom, curiosity, and happiness; and relatively more 
frequent than the remaining states. The results were particularly 


Table 5 
Weighted Mean Effect Sizes (d,.) for Pairwise Effects 


Affect | Ang Anx Bor Con Cmt Cur 





-—0.33 —0.94 


—0.91 


—0.82 
—0.54 
0.12 


=0!20) =0:34 
—0.68 


0.69 


Anger 
Anxiety 
Boredom 
Confusion 
Contempt 
Curiosity 
Delight 
Disgust 
Fear 
Engagement/flow 
Frustration 
Happiness 
Sadness 
Surprise 


0.99 
0.72 : 
—0.14 
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interesting for frustration. Though relatively less frequent than 
flow/engagement, boredom, and confusion, frustration was not 
significantly different than curiosity, happiness, and delight and 
was relatively more frequent than anger, anxiety, contempt, dis- 
gust, sadness, and surprise. 

The average pairwise d, for all comparisons involving engage- 
ment/flow, boredom, confusion, curiosity, happiness, and frustra- 
tion was 0.53 sigma, whereas anxiety, anger, disgust, contempt, 
sadness, delight, fear, and surprise yielded an average pairwise d_, 
of —0.43 sigma. In summary, both the present results of the 
pairwise effect sizes and the previous results with overall effect 
sizes suggest that engagement/flow, boredom, confusion, curios- 
ity, happiness, and frustration are the affective states that are more 
likely to occur during learning with technology, at least to the 
extent of the 24 studies included in this analysis. 


General Discussion 


The present paper focuses on identifying the moment-to- 
moment affective states that students’ experience during learning 
with technology. Considerable variability in the distribution of 
affective states was expected due to differences in student popu- 
lations, learning technologies, learning contexts, and methodolo- 
gies used for affect monitoring. The pertinent question was 
whether a set of generalizable learning-centered affective states 
that transcend inherent differences among students, technolo- 
gies, tasks, and methodologies could be identified. This ques- 
tion was addressed via a selective meta-analysis of 24 studies 
that systematically monitored the affective states of diverse 
samples of 1,740 students over the course of interaction with a 
variety of learning technologies. In this section, I discuss the 
major findings, align these findings with existing and emerging 
theoretical perspectives on affect and learning, consider impli- 
cations of the findings for the design of ALTs, and discuss 
limitations and potential areas of research that are particularly 
promising for future work. 


Affect 2 
Del Dis Fear Eng Fru Hap Sad Sur 
— 0.05 0.36 — —0.80 —0.49 0.16 0.21 
OT O:S8 0:60 — 124 — 0:50 0107 0.49 0.22 
0.88 1.03 1.10 -—1.49 0.48 0.43 0.98 1.08 
101 066 0.90 -—1.76 0.29 0.07 0.72 1.38 
— 0.21 — — —0.83 — — a 
0.59 0.35 = —0.87 0.00 — — 0.93 
—— — Ooo — — 0.55 
0.45 — =0.82 ~—0.57 0.12 0.32 
— -1.08 -0.80 -0.24 —0.15 
2.21 0.96 — 2.33 
0.18 0.86 0.78 
0.62 0.72 
0.13 





Note. Significant d,s are bolded. Positive values indicate that Affect 1 is significantly greater than Affect 2 (reverse for negative values). For example, 
the Anger—Anxiety d, of —0.33 indicates that anger was less frequent than anxiety. Blank cells (—) indicate that there were insufficient data for analysis. 
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Overview of Major Findings 


A number of conclusions can be drawn from the results of this 
analysis. The primary finding is that engagement/flow, boredom, 
confusion, curiosity, happiness, and frustration appear to be the set 
of discrete affective states that are relatively more frequent during 
learning with technology. The fact that engagement/flow was 
relatively highly frequent and its relative frequency increased with 
ALTs compared to less sophisticated learning environments sug- 
gests that the enhanced interactivity, personalized instruction, 
rapid feedback, and other features of ALTs have the intended 
effect of engaging students. Unfortunately, this result is somewhat 
tempered, as boredom was also relatively quiet frequent in several 
of the studies and more so with less interactive technologies. 

It was also informative to discover that confusion was relatively 
frequent in several studies. Confusion, which is sometimes re- 
ferred to as a knowledge emotion (Silvia, 2009) or an epistemic 
emotion (Pekrun, 2010), occurs when students experience im- 
passes when processing new information that clashes with prior 
knowledge and exposes problematic misconceptions and errone- 
ous mental models (D’Mello & Graesser, 2012; Graesser, Lu, et 
al., 2005; Piaget, 1952). Learning is expected to be positively 
impacted to the extent that students reason and problem solve to 
resolve impasses (VanLehn et al., 2003) and undergo a form of 
conceptual change by revising their mental models (Dole & Sina- 
tra, 1998). 

In addition to these three states (engagement/flow, confusion, 
and boredom), curiosity, happiness, and frustration were observed, 
albeit with lower relative frequency. Curiosity is expected to be 
prominent when interest and motivation in the learning activity are 
high (Hidi, 2006), when there is novelty (Berlyne, 1978; Silvia, 
2009), and when students have some degree of choice over the 
specifics of the learning task (Cordova & Lepper, 1996). The 
learning contexts analyzed in this paper differed in the extent to 
which they supported these antecedents of curiosity, which is one 
hypothesis to explain the relatively lower occurrence of this state. 

Frustration is a state that occurs when there is negative feed- 
back, when important goals are blocked, when there is persistent 
failure, and when students are stuck and do not have an immediate 
plan to proceed (Burleson & Picard, 2004; D’Mello & Graesser, 
2012; Stein & Levine, 1991). Frustration-inducing events occur 
quite frequently over the course of learning conceptually difficult 
topics or when students attempt to solve challenging problems. 
It might be the case that the relative incidence of frustration was 
not exceedingly high in the studies that were analyzed, because 
most ALTs do not let students perseverate when they are stuck but 
offer hints, worked examples, and other opportunities to move the 
session forward. Indeed, the mixed-effects models revealed that 
frustration was lower in studies that used ALTs compared to 
simpler interfaces. 

In contrast to frustration, happiness is expected to occur when 
there is positive feedback on a student action, when students get 
insights to resolve troublesome impasses, and when intermediate 
goals are attained (Stein & Levine, 1991). It is difficult to imagine 
students sustaining states of happiness during learning with tech- 
nologies that actively advance the session by introducing new 
content and providing new problems to solve. This is because the 
new material can trigger an entirely different profile of affective 
states, thereby giving students little time to revel in their happi- 
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ness. Hence, the overall lower relative incidence of happiness is 
consistent with expectations. 

The discussion so far has focused on the six affective states that 
were found to occur with notable frequency. However, negative 
evidence can also be quite compelling, and it is particularly infor- 
mative that eight of the affective states analyzed were found to be 
relatively infrequent. These include contempt, anger, disgust, de- 
light, anxiety, sadness, surprise, and fear. At first blush, the rela- 
tively low frequency of anxiety might be somewhat surprising, but 
it is important to note that the consequences of failure in the 
learning tasks were not severe. Anxiety is likely to be heightened 
during high-stakes learning tasks, such as preparing for an impor- 
tant exam or taking a standardized test (Zeidner, 2007). That being 
said, it is important to note that the levels of urgency of the tasks 
examined in this analysis are comparable to many real-world 
learning tasks (e.g., solving homework problems, writing a book 
report, reading the textbook), so the findings are applicable to tasks 
with low to moderate urgency. 

Aside from anxiety and delight, the remaining six states that 
were relatively infrequent can be considered to be basic emotions 
(Ekman, 1992; Izard, 2007). This suggests that, with the exception 
of happiness (a basic emotion), the basic emotions might be less 
relevant in short, in-depth learning sessions with technology, at 
least when it comes to the 24 studies analyzed in this paper. This 
finding is intuitively plausible, because there is no adequate jus- 
tification to expect the average student to experience persistent 
episodes of sadness and disgust when interacting with a reasonably 
well-engineered learning technology on a task of low to moderate 
urgency. The basic emotions might be more relevant during longer 
sessions (e.g., completing a dissertation), when stakes are high 
(e.g., studying for an important exam), or when there are interac- 
tions with peers and superiors, but this is entirely an empirical 
question. As it currently stands, it is a set of nonbasic affective 
states (and happiness) that play an active role when students 
complete short but focused learning tasks with technology. 

Finally, the analysis identified four between-study moderators 
that were predictive of student affect. Methodological and contex- 
tual factors, composed of whether the affect labels were provided 
by the students themselves or by external observers and whether 
the study was conducted in more authentic classroom setting 
versus a lab, were generally more predictive than the use of ALTs 
and training time. The main finding was that engagement/flow was 
more frequent when an ALT was used, when the study was 
conducted in an authentic learning context, and when the affect 
judgments were provided by observers rather than the students 
themselves. In contrast, boredom and frustration were predomi- 
nantly observed in laboratory studies, when simple computer in- 
terfaces were used in lieu of ALTs, and when affective states were 
measured via self-reports. Taken together, these findings highlight 
how study-level factors influence student affect. An important 
message is that the affective state distributions are mainly influ- 
enced by the activity, location, and source of the measurements. In 
particular, affective states are impacted by what activity the stu- 
dent is engaged in (ALT vs. simpler computer interface), where the 
measurement occurs (schools vs. classrooms), and who is perform- 
ing the measurement (self vs. others). Indeed, affective states are 
highly situation dependent and contextually coupled, instead of 
being dispositional and context free. 
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Theoretical Implications 


The finding that engagement/flow, boredom, confusion, curios- 
ity, happiness, and frustration were the predominant affective 
states can be aligned with theories that specify how these states 
might arise from (a) appraisals of control and value of the learning 
activity (Pekrun, 2010; Pekrun et al., 2010), (b) appraisals of goal 
congruence and plan availability (Stein & Levine, 1991), and (c) 
impasse detection, impasse-resolution processes, and states of cog- 
nitive disequilibrium (Graesser, Lu, et al., 2005; Piaget, 1952; 
VanLehn et al., 2003). With the exception of the control-value 
theory, which attempts to be somewhat comprehensive, the other 
theories focus on only one or two affective states. Hence, a brief 
sketch of how the present findings can be aligned with an integra- 
tive account of these theories is provided below. 

The control-value theory emphasizes how appraisals of the 
perceived value in and control over the learning activity predict the 
affective states that students experience (Pekrun, 2010; Pekrun et 
al., 2002). Students perform these appraisals at multiple instances 
during a learning session, and continual appraisals with respect to 
challenges and progress can trigger major changes in student 
affect. Engagement/flow is expected to be heightened when stu- 
dents see value in the learning activity and when there is an 
appropriate balance between skill and challenges, so that they have 
some control over the outcome of the activity (Pekrun et al., 2010). 
Like engagement, curiosity is expected to be increased when 
intrinsic motivation in the learning task is high, which in turn 
influences appraisals of value (Berlyne, 1978; Cordova & Lepper, 
1996). In contrast, boredom occurs when value is low, when skills 
exceed challenges (too high control; Csikszentmihalyi, 1990), and 
when challenges exceed skills (too low control; Acee et al., 2010; 
Pekrun et al., 2010). 

On the basis of appraisals of control and value, students are 
typically in a state of either (a) engagement/flow, when they pursue 
the superordinate goal of mastering the material in the learning 
environment, or (b) disengagement (boredom), when they abandon 
pursuit of the superordinate learning goal. According to goal 
appraisal theories (Mandler, 1999; Stein & Levine, 1991), events 
that arise during the learning activity are constantly being ap- 
praised with respect to their congruence with the superordinate 
goal. Appraisals of these events, particularly with respect to goals, 
give rise to affective states. The arousal level (intense/weak) of the 
affective state is dependent upon how relevant the event is to the 
goal, whereas the valence (positive/negative) depends on whether 
the event is congruent or incongruent with respect to the goal 
(Mandler, 1984; Stein & Levine, 1991). Events that are consistent 
with the achievement of goals result in positive affective states, 
such as happiness, whereas outcomes that jeopardize goal achieve- 
ment can result in negative affective states, such as frustration. 

Confusion and frustration are states of considerable interest 
because they trigger subgoals associated with resolving goal- 
blocking events. Theories that emphasize the importance of inter- 
ruptions, impasses, and cognitive disequilibrium to learning 
(Graesser, Lu, et al., 2005; Mandler, 1984, 1999; Siegler & Jen- 
kins, 1989; VanLehn et al., 2003) posit that the students get 
confused when they are confronted with troublesome impasses and 
when they are uncertain about what to do next. The student can 
initiate a subgoal of effortful reasoning and problem solving to 
resolve the impasse and restore equilibrium. Equilibrium is re- 
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stored when the source of the discrepant information is discovered, 
the impasse is resolved, and the student reverts back into the state 
of engagement/flow (D’Mello & Graesser, 2012). Frustration oc- 
curs when the impasse cannot be resolved, when the student gets 
stuck, or when there is no available plan to resolve the goal- 
blocking event (Stein & Levine, 1991). 


Implications for ALTs 


The results have important implications for ALTs that do not 
currently model affective states as well as for next-generation 
systems that aim to be sensitive to students’ affective states in 
addition to their cognitive states. On one hand, the fact that 
engagement/flow was the most (relatively) frequent affective state 
is promising for current learning technologies, especially for those 
that aspire to keep students focused; for example, by engaging in 
adaptive dialogue moves similar to human tutors (Graesser, Chip- 
man, et al., 2005) or by incorporating narratives, seductive details, 
and increasing interactivity, as is the case with simulation and 
game-based environments (Sabourin et al., 2011). On the other 
hand, boredom was relatively frequent and curiosity was relatively 
infrequent (compared to boredom); this implies that there is still 
considerable room for improvement, especially if the goal is to 
increase motivation, interest, and engagement. This is an important 
goal, because active engagement is a prerequisite to the deploy- 
ment of key cognitive and metacognitive processes, such as ef- 
fortful problem solving, self-explanation, prior knowledge activa- 
tion, planning, and inference generation. Hence, more basic 
research is needed to identify the factors that influence engage- 
ment and task persistence. Insights gleaned from such a research 
program can be integrated into ALTs so that they can increase 
curiosity and sustain long-term engagement over multiple sessions. 

In a related but somewhat different vein, the present results 
suggest that the student models of next-generation ALTs should 
incorporate affective states in addition to cognitive states and 
knowledge levels. It might be possible to design educational tech- 
nologies that increase positive affective states such as engagement, 
curiosity, and happiness, but it is highly unlikely that such systems 
will prevent the occurrence of boredom, frustration, and other 
negative affective states. These negative states can have crippling 
effects on task persistence, motivation, and learning gains (Craig et 
al., 2004; D’Mello & Graesser, 2011; Linnenbrink & Pintrich, 
2002; Pekrun et al., 2010), so next-generation ALTs should incor- 
porate mechanisms to intelligently handle the inevitable occur- 
rence of negative affective states in a manner that is contextually 
constrained and dynamically adaptive to individual students. Al- 
though the research community is beginning to make considerable 
progress along this front (Arroyo et al., 2009; Conati & Maclaren, 
2009; D’Mello et al., 2010; du Boulay et al., 2010; Forbes-Riley & 
Litman, 2009; Sabourin et al., 2011; Sazzad, AlZoubi, Calvo, & 
D’Mello, 2011), there is significant uncertainty associated with 
identifying which affective states to target, developing systems to 
detect these states, and implementing strategies to respond to the 
sensed states. The present analysis suggests that it might be ad- 
visable to initially focus on boredom, confusion, and frustration, 
because these were the relatively more frequent negative affective 
states across several studies and learning technologies. 

The first step in developing an affect-sensitive ALT to respond 
to boredom, confusion, and frustration requires automated meth- 
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ods to detect affect. Previous work has indicated that it is possible 
to automatically classify these states by analyzing contextual in- 
formation (e.g., whether negative vs. positive feedback was deliv- 
ered, difficulty of the current problem) and interaction features 
(e.g., time taken for students to respond to a question, response 
verbosity; Baker et al., 2012; D’Mello, Craig, Witherspoon, Mc- 
Daniel, & Graesser, 2008), along with behavioral and physiolog- 
ical markers that have been extensively studied by researchers in 
the fields of nonverbal behavior and affective computing (see these 
reviews: Calvo & D’Mello, 2010; D’Mello & Kory, 2012; Pantic 
& Rothkrantz, 2003; Valstar, Mehu, Jiang, & Pantic, 2012; Zeng 
et al., 2009). In addition to detecting affect, understanding why a 
particular affective state was triggered is critical in order to facil- 
itate more nuanced interventions to regulate affect. For example, 
there appears to be a curvilinear relationship between perceived 
control and boredom, in that boredom is frequent when control is 
too high (i.e., skill outweighs challenges; Csikszentmihalyi, 1990) 
but also when control is too low (challenges outweigh skill; Pekrun 
et al., 2010). A possible strategy to respond to boredom stemming 
from appraisals of low control is to decrease problem difficulty, 
but it would be beneficial to ramp up difficulty when boredom 
emerges from appraisals of high control. As this example illus- 
trates, interventions that are not informed by the potential causes 
that underlie an affective state are less likely to be effective and 
might even be harmful if the intervention is misaligned with the 
cause of an affective state. In other words, the affective state might 
be a symptom of an underlying cause, be it a lack of motivation or 
a lack of knowledge, so it is important to understand the cause in 
order to effectively treat the symptom. Understanding the causal 
structure that gives rise to particular affective states and distinct 
manifestations of a single state (e.g., different forms of boredom) 
requires fine-grained monitoring of the sequence of system- and 
learner-generated events immediately prior to an affective episode. 
This information not only is important to engineer effective affect- 
sensitive interventions but also contributes to the basic research on 
understanding affect during learning. It is impossible to obtain this 
level of fine-grained information at a large scale in more tradi- 
tional learning settings (e.g., a classroom) or in technologically 
light settings. However, ALTs provide an excellent forum to 
investigate affective experience in fine-grained detail, because 
they are specifically designed to be dynamically sensitive to indi- 
vidual students at fine-grained levels (i.e., microadaptivity or the 
inner loop 4 la VanLehn, 2006), and they usually keep meticulous 
logs of system- and learner-generated events that can be automat- 
ically mined. 


Limitations and Resolutions 


There are four noteworthy limitations with this analysis. 
First, the relatively small number of studies analyzed is of some 
concern. The fact that only 24 studies were considered can be 
attributed to the small number of available studies with usable data 
and the informal nature of the search. Inconsistent measurement 
models, extremely low sample sizes, and the failure to publish 
error estimates on affect proportions were additional factors that 
contributed to the elimination of a number of studies. There was 
sufficient diversity with respect to the learning technologies, sub- 
ject domains, student characteristics, and affect judgment method- 
ologies, thereby alleviating some of the generalizability concerns. 
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Nevertheless, it would be advisable to repeat this analysis as 
additional studies on affect and learning emerge in the literature. 

The second limitation was that there was an imbalance in the 
number of affective states considered across studies. Some affec- 
tive states were included in a large number of studies, and others 
were considered in only a handful of studies. For example, sad- 
ness, contempt, and fear were included in only three studies, and 
they were found to be relatively infrequent. On the other hand, 
surprise, which had the lowest d,, was included in 14 studies. 
These examples illustrate that it is not clear whether the relative 
incidence of an affective state was affected by the number of 
studies that included that state. Fortunately, d,s for each state were 
not significantly correlated with the number of studies with avail- 
able data (r = .179, p = .541). This suggests that imbalance in the 
number of affective states did not significantly impact the results, 
although more data are needed before one can be confident in the 
effect size distributions of states that were not sufficiently repre- 
sented in this set of studies. 

The third limitation emerges from the fact that most of the 
affective states considered in the studies tended to be activating 
states (i.e., states with moderate to high arousal). These included 
positive activating states such as happiness, delight, and curiosity 
as well as negative activating states such as anger, confusion, and 
frustration. Sadness and boredom were the only two negative 
deactivating states (states with lower arousal), and positive deac- 
tivating states, such as calmness and relaxed, were missing en- 
tirely. This imbalance with respect to deactivating states makes it 
impossible to make any claims about the relative incidence of 
these states. It is therefore advisable that future research studies 
should expand the set of affective states in order to strike more of 
a balance between activating and deactivating states. 

Previous research that has performed between-domain and 
within-domain comparisons on motivational orientations and aca- 
demic emotions has indicated important domain effects (Bong, 
2001; Goetz, Frenzel, Pekrun, Hall, & Liidtke, 2007; Goetz, 
Pekrun, Hall, & Haag, 2006). For example, Goetz et al. (2006) 
found that affective responses did not generalize from one domain 
to another and that the degree of domain-specificity varied across 
states. The present study did not include domain as a potential 
moderator, because the sample size of 24 studies was not suffi- 
ciently large to accommodate the considerable diversity in learning 
domains. This is the fourth limitation, and it should be addressed 
with a larger sample of studies. 


Concluding Remarks 


This paper investigated one of the most fundamental issues in 
the burgeoning research area on affect during learning with tech- 
nology. The purpose was to identify a set of discrete affective 
states that generalize across student populations, subject domains, 
learning contexts, learning technologies, and methods used to 
monitor affect. Learning was narrowly construed as a short but 
involved exchange between a student and some form of educa- 
tional technology, and the focus was on fine-grained assessments 
of natural expressions of discrete affect. 

The key take-home messages are that (a) affect distributions are 
consistent with a three-level hierarchy consisting of engagement/ 
flow that was relatively frequent in all the studies; boredom, 
confusion, curiosity, happiness, and frustration, whose relatively 
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frequency varied across studies; and eight relatively infrequent 
affective states (contempt, anger, disgust, sadness, anxiety, delight, 
fear, and surprise); (b) the relative frequency of the affective states 
is mainly related to the source of the affect judgment and the 
authenticity of the learning context; (c) it was not the basic 
emotions that have dominated the scientific landscape for over a 
century but, with the exception of happiness, a set of nonbasic 
affective states that was more relevant in the learning sessions that 
were analyzed; and (d) researchers might consider targeting bore- 
dom, confusion, and frustration as negative affective states to 
detect and address with appropriate strategies in affect-sensitive 
ALTs. 

In conclusion, this selective meta-analysis was intended to serve 
as an initial step toward organizing and synthesizing some of the 
emerging research on affect and learning. The hope is that it will 
serve as a launching point toward more basic research on the relative 
incidence, antecedents, dynamics, and consequences of the five non- 
basic affective states (plus happiness) that appear to be relatively 
more frequent during learning with technology. More advances in 
basic research should inspire and challenge the ALTs of the future 
to include mechanisms that increase motivation and sustain per- 
sistence by embracing rather than ignoring affect. 
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How can assessments measure complex science learning? Although traditional, multiple-choice items can 
effectively measure declarative knowledge such as scientific facts or definitions, they are considered less 
well suited for providing evidence of science inquiry practices such as making observations or designing 
and conducting investigations. Thus, students who perform very proficiently in “science” as measured by 
static, conventional tests may have strong factual knowledge but little ability to apply this knowledge to 
conduct meaningful investigations. As technology has advanced, interactive, simulation-based assess- 
ments have the promise of capturing information about these more complex science practice skills. In the 
current study, we test whether interactive assessments may be more effective than traditional, static 
assessments at discriminating student proficiency across 3 types of science practices: (a) identifying 
principles (e.g., recognizing principles), (b) using principles (e.g., applying knowledge to make predic- 
tions and generate explanations), and (c) conducting inquiry (e.g., designing experiments). We explore 
3 modalities of assessment: static, most similar to traditional items in which the system presents still 
images and does not respond to student actions, active, in which the system presents dynamic portrayals, 
such as animations, which students can observe and review, and interactive, in which the system depicts 
dynamic phenomena and responds to student actions. We use 3 analyses—a generalizability study, 
confirmatory factor analysis, and multidimensional item response theory—to evaluate how well each 
assessment modality can distinguish performance on these 3 types of science practices. The comparison 
of performance on static, active, and interactive items found that interactive assessments might be more 
effective than static assessments at discriminating student proficiencies for conducting inquiry. 


Keywords: educational assessment, science education, multimedia, psychometrics, technology enhanced 


assessment 


Multiple forces are converging to propel science testing into the 
digital age. Recent national science education frameworks and 
standards advocate a significant shift in focus to fewer, more 
integrated core ideas, deeper understanding of dynamic science 
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systems, and greater use of science inquiry practices. Federal 
accountability requirements call for science testing at the elemen- 
tary, middle, and secondary grades. International, national, and 
state tests are turning to technology to improve the efficiency of 
large-scale testing and extend the standards assessed. 

In view of these forces, science educators are concerned with the 
suitability of available assessments for measuring what students 
should know and be able to do in science. For example, the recent 
College Board Standards for Science Success, the National Re- 
search Council Framework for K-12 Science Education, and the 
draft Next Generation Science Standards recommend deeper learn- 
ing such as understanding the fundamental nature and behavior of 
science systems, along with the inquiry practices scientists use to 
study system dynamics (College Board, 2009; National Research 
Council [NRC], 2012). Yet, most existing large-scale science 
accountability tests do not address the full range of valued science 
standards, particularly understanding science systems and inquiry 
practices (Darling-Hammond, 2010; Darling-Hammond & 
Pecheone, 2010). As a result, there is concern about the construct 
validity of science accountability tests, that is, that the prevalent 
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static, multiple-choice item format that relies on recognition of 
correct answers does not elicit evidence of integrated knowledge 
about science systems or abilities to conduct scientific inquiry (cf. 
Liu, Lee, & Linn, 2011; Quellmalz & Pellegrino, 2009; Smith, 
Wiser, Anderson, & Krajcik, 2006). 

Technology-based science tests are being developed to address 
the perceived limitations of traditional tests. These interactive 
assessments are rapidly appearing in state, national, and interna- 
tional testing programs. For example, the 2009 National Assess- 
ment of Educational Progress (NAEP) included interactive com- 
puter tasks (ICTs). The 2006, 2009, and 2012 cycles of the 
Programme for International Student Assessment (PISA) include 
computer-based forms (Koomen, 2006; National Assessment Goy- 
erning Board [NAGB], 2008). At the state level, Minnesota has an 
online science test with simulated laboratory experiments and 
investigations of phenomena such as weather or the solar system 
(Minnesota Department of Education, 2010). Utah is piloting sci- 
ence simulations in their assessments (King, 2011). Moreover, the 
state testing consortia are designing technology-enhanced items to test 
English Language Arts and Math common core standards, so it is 
likely that tests of the forthcoming Next Generation Science Stan- 
dards will include innovative task and item formats. Further, results 
from the 2009 NAEP science ICTs showed that while most students 
performed well when making low-level observations from data, the 
majority of students performed poorly on complex assessment tasks 
involving multiple variables or strategic decision-making (National 
Center for Education Statistics [NCES], 2012). Though exam results 
reflect differences in student scores on interactive items, no studies 
have evaluated whether the interactive test modality is indeed more 
effective at distinguishing student proficiency on different science 
practice skills. 

In this article, we report an empirical study to test whether 
interactive assessments may be more effective than traditional, 
static assessments at discriminating student proficiency across 
three types of science practices. We employed the three key 
science practice skills from the NAEP 2009 Framework; (a) iden- 
tifying principles (e.g., stating or recognizing scientific principles), 
(b) using principles (e.g., applying knowledge to make predictions 
and generate explanations), and (c) conducting inquiry (e.g., de- 
signing, running and interpreting experiments). We developed 
assessments in three modalities with increasing levels of interac- 
tivity. The static modality is the most similar to traditional assess- 
ments in which the system presents still images and text and is not 
responsive to student actions. The active modality presents dy- 
namic portrayals of phenomena, such as animations, which the 
student can observe and review. Finally, the interactive modality 
presents dynamic representations of the science phenomena and is 
responsive to student actions. 


Why Use Simulations to Assess Science Practices? 


The powerful capabilities of technology may hold the key to 
transforming the range of science knowledge and practices that can 
be assessed (Quellmalz & Haertel, 2004; Quellmalz & Pellegrino, 
2009; Quellmalz et al., 2011 ). Scientists use physical, mathemat- 
ical, and conceptual models as tools for asking questions, testing 
hypotheses, and communicating findings about natural and de- 
signed systems (Clement, 1989; Nersessian, 2008 ). Simulations 
and modeling tools dynamically represent the spatial, causal, and 


temporal processes in science systems and permit active, virtual 
investigations of phenomena that are too big or small, fast or slow, 
or dangerous to be conducted in hands-on labs (de Jong, 2006; 
Lehrer, Schauble, Strom, & Pligge, 2001; Quellmalz & Pellegrino, 
2009; Stewart, Carter, & Passmore, 2005). As simulations and 
models become “tools of the trade” in science, they also become 
important mechanisms for allowing students to demonstrate a 
range of science practices including making observations and 
designing and carrying out investigations. 

Numerous studies that have illustrated the benefits of science 
simulations for student learning have also demonstrated the potential 
of simulations for assessment. Simulations can support the develop- 
ment of deeper understanding and better problem-solving skills in 
areas such as genetics, environmental science, and physics (cf. Adams 
et al., 2008; Buckley, Gobert, Horwitz, & O’ Dwyer, 2010; Horwitz, 
Gobert, Buckley, & O’Dwyer, 2010; Krajcik, Marx, Blumenfeld, 
Soloway, & Fishman, 2000; Schwartz & Heiser, 2006; Zacharia, 
2007). For instance, students using an aquatic ecosystem simulation 
or a collective simulation of multiple human body systems were able 
to demonstrate causal connections among the levels of these systems 
(Hmelo-Silver et al., 2008; Ioannidou et al., 2010; Slotta & Chi, 2006; 
Vattam et al., 2011). Using a computer-based, simulation tool that 
allowed students to create, test, and revise models helped students 
develop more robust and transferrable modeling skills than 
worksheet-based instruction (Papaevripidou, Constantinou, & Zacha- 
ria, 2007). Interactive assessment tasks that take advantage of the 
affordances of simulations have the potential to capture evidence of 
progress on the use of complex science practices and to transform 
how science is tested. 


Designing Effective Assessments 


Though simulations can provide compelling environments for 
evaluating a range of science practices, assessments are only as 
effective as their design. Research provides guidance regarding the 
identification of a scientifically appropriate context, the alignment 
between assessment tasks and learning objectives, and the mini- 
mization of extraneous cognitive processing. Below we summarize 
research related to these facets of effective assessment design and 
present a set of design principles that were used to guide the 
development of the assessments for the current study. 


Selecting a Scientifically Appropriate Context 


Taking Science to School and Applying Cognitive Science to 
Education recommend that rather than teaching and testing indi- 
vidual ideas and skills separately, knowledge and skills be taught 
and tested in the context of a larger investigation linked to a 
driving question. Currently, PISA, Trends in International Math- 
ematics and Science Study (TIMMS), NAEP, and state tests ad- 
minister problem-based sets of inquiry tasks set in authentic con- 
texts. The reports recommend that assessments probe integrated 
knowledge structures (schema), contextualize items in meaningful 
tasks, and address not just declarative and procedural knowledge, 
but also measure schematic knowledge and strategic reasoning in 
problem solving and inquiry tasks. The Framework for K-12 
Science Education and draft Next Generation Science Standards 
echo these research-based design principles. 

The recommendations for assessments resonate with cognitive 
science research on expertise. Across academic and practical do- 
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mains, research on the development of expertise indicates that 
experts have acquired large, organized, interconnected knowledge 
structures, called schema, and well-honed, domain-specific 
problem-solving strategies (Bransford, Brown, & Cocking, 2000). 
Jacobsen characterized the schema of experts as “complex sys- 
tems” mental models in contrast to the deterministic “clock-work” 
mental models of novices (Jacobson, 2001). Learning theory holds 
that the learning environments in which students acquire and 
demonstrate knowledge should represent contexts of use (Collins, 
Brown, & Newman, 1989; Simon, 1980). Critics of current testing 
practices cite the overemphasis on disconnected, decontextualized 
declarative and procedural knowledge in contrast to integrated 
schematic knowledge and strategic problem solving and inquiry 
(Linn & Eylon, 2011; Quellmalz & Pellegrino, 2009). 


Aligning Assessment Tasks and Learning Objectives 


If students have a deep understanding of a science system, they 
should both understand core concepts and be able to use their 
knowledge to make inferences and conduct scientific investiga- 
tions. Thus, the challenge of science assessment is to develop tasks 
that do not simply tap into disconnected bits of declarative and 
procedural knowledge but that call for the schematic and strategic 
knowledge needed to reason about complex systems and engage in 
scientific inquiry practices. 

The NRC report, Knowing What Students Know, integrated the 
learning research summarized in How People Learn with advances in 
measurement science to describe systematic test design frameworks. 
The evidence-centered assessment design framework provides a strat- 
egy for ensuring that assessments tasks are aligned with the learning 
objectives (Messick, 1994; Mislevy, Almond, & Lukas, 2003; Pel- 
legrino, Chudowsky, & Glaser, 2001). The framework suggests a 
process that begins with a clear specification of learning objectives in 
what it refers to as a student model. For example, a part of the student 
model may be that “students will be able to design a controlled 
experiment to test a hypothesis.” Next, the framework specifies a task 
model that describes the features of the task and environment that will 
allow students to demonstrate they have mastered the skills specified 
in the student model. For instance, a simulation might allow students 
to set a number of variables and observe the results of trials run with 
the settings they selected. Finally, the evidence model specifies what 
student responses and scores would serve as evidence of proficiency. 


Table 1 
2009 NAEP Science Practices 


QUELLMALZ ET AL. 


For example, the rubric for the controlled experiment design task 
would assume student mastery if the student varied the single variable 
of interest and controlled the remaining variables across trials. 

Cognitively principled assessment design for science begins with a 
student model derived from a theoretical framework of the kinds of 
knowledge structures and strategies students should demonstrate as 
evidence of their level of expertise. The model-based learning and 
national science education frameworks and standards identify the 
broad conceptual knowledge structures and inquiry practices deemed 
by the profession to be goals of science education (Achieve, 2012; 
American Association for the Advancement of Science [AAAS], 
1993; College Board, 2009; NAGB, 2009; NRC, 2011 ). The science 
practices set forth for the 2009 Science NAEP guided the science 
inquiry targets specified in the student model for our study. Table 1 
summarizes the science practices and their cognitive demands as 
specified in the 2009 NAEP Science Framework. 

For the task model, we extracted tasks from learning research on 
inquiry. This work shows that students in kindergarten through 
eighth grade, with appropriate scaffolding, can engage in investi- 
gations, make hypotheses, gather evidence, design investigations, 
evaluate hypotheses in light of evidence, and build their conceptual 
understanding (Geier et al., 2008; Lehrer & Schauble, 2002; Metz, 
2004). A number of studies provided evidence that such project- 
based experiences helped students learn scientific practices. 
Kolodner et al. (2003) found that middle school students who 
practiced inquiry in several project-based science units performed 
better on the inquiry tasks of scientific practice (as measured by 
performance assessments, Quellmalz, Schank, Hinojosa, & Pa- 
dilla, 1999) than students from traditional classrooms. Moreover, 
all students, particularly English language learners, benefited 
greatly from inquiry-based science instruction that depended less 
on mastery of English than does decontextualized textbook knowl- 
edge or direct instruction by the teacher (O. Lee, 2002). Engaging 
students in active investigations allows students to demonstrate 
science inquiry skills and has been shown to increase conceptual 
understanding (cf. Dede, 2009; Geier et al., 2008; Kolodner et al., 
2003; Lehrer & Schauble, 2002; Marx et al., 2004; Metz, 2004; 
Rivet & Krajcik, 2004). 

Specifications of the evidence models were based on identifying the 
types of student responses within the simulation-based tasks that 
would serve as evidence of proficiency on science knowledge related 





Science practice 


Cognitive demand 


Examples of skills related to practice 


Identifying principles Declarative knowledge, “Knowing that” »* Knowing facts 


Using principles 


Conducting inquiry 
“Knowing how and when” 


Schematic knowledge, “Knowing why” 


Procedural and strategic knowledge, 


* Stating and recognizing science principles 


¢ Using patterns in observations 
¢ Making predictions 
¢ Creating explanations 


¢ Designing experiments 

¢ Testing predictions 

¢ Generating conclusions 
¢ Evaluating explanations 


Note. NAEP = National Assessment of Educational Progress. 
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to three system model levels (components, interactions, and emergent 
system behavior) and the inquiry practices specified in the NAEP 
2009 Framework. Rules were generated for scoring responses and 
summarizing them in order to report the proficiency levels. 


Minimizing Extraneous Processing 


Though visualizations and simulations have many affordances 
for learning, the additional information they present may also 
distract or overwhelm students. Multimedia learning researchers 
have examined the effects of pictorial and verbal stimuli in static, 
animated, and dynamic formats, as well as the effects of active 
versus passive learning enabled by degrees of learner control 
(Clark & Mayer, 2011; Lowe & Schnotz, 2007: Mayer, 2005b). 
Mayer’s (2005a) Cambridge Handbook of Multimedia Learning 
and Clark and Mayer’s recently updated book, eLearning and the 
Science of Instruction summarize multimedia research and offer 
principles for multimedia design (Clark & Mayer, 2011). 

The majority of multimedia design principles address how to 
focus students’ attention and minimize extraneous cognitive pro- 
cessing. Research suggests guiding attention by making the most 
important information salient and omitting irrelevant representa- 
tions (cf. Betrancourt, 2005; Clark & Mayer, 2011). The use of 
visual cues such as text consistency, color, and arrows can help 
students map between representations and gain a deeper concep- 
tual understanding (cf. Ainsworth, 2008; Kriz & Hegarty, 2007; 
Larkin & Simon, 1987; Lowe & Schnotz, 2008; Pedone, Hummel, 
& Holyoak, 2001). 

Research specific to animations and interactive environments 
finds that the temporal nature of dynamic displays places increas- 
ing attention and memory demands on learners. To mitigate these 
additional demands, the research recommends (a) task-format 
alignment, (b) allowing for user control, (c) signaling upcoming 
changes, and (d) ensuring that the fidelity of the display is appro- 
priate for the task. Task-format alignment suggests that dynamic or 
interactive features should be included only when they are required 
for the task. Animations are considered particularly useful for 
providing visualizations of dynamic phenomena that are not easily 
observable in real space and time scales (cf. plate tectonics, cir- 
culatory system, animal movement; Betrancourt, 2005; Kuhl, 
Scheiter, Gerjets, & Edelmann, 2011). User control allows stu- 


Table 2 
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dents to pause, rewind, and replay dynamic presentations. Con- 
trolling the pace of the presentation can increase the likelihood that 
students will learn from and understand the display (cf. Lowe & 
Schnotz, 2008; Schwartz & Heiser, 2006). Signaling complex 
animations by giving cues such as “there will be three steps” and 
directly instructing students to reason through the components of 
systems increases student comprehension (Hegarty, 2004: 
Schwartz & Black, 1999; Tversky, Heiser, Lozano, MacKenzie, & 
Morrison, 2008). Mayer and Johnson (2008) found that redun- 
dancy of text in multimedia presentations may be beneficial when 
on-screen text is short, highlights the key action described in the 
narration, and is placed next to the portion of the graphic that it 
describes in order to highlight salient features of a multimedia 
presentation. Finally, the fidelity principle suggests that the com- 
plexity of a simulation should be appropriate for the learner 
outcomes. Rather than realistically portraying every detail of a 
system, it is more important to ensure that the most relevant parts 
of the system are easily discernible (cf. H. Lee, Plass, & Homer, 
2006; van Merriénboer & Kester, 2005). 

Animations become interactive simulations if learners can 
manipulate parameters as they generate hypotheses, test them, 
and see the outcomes, therefore taking advantage of technolog- 
ical capabilities well suited to conducting scientific inquiry. For 
example, Rieber, Tzeng, and Tribble (2004) found that students 
given graphical feedback during a simulation on laws of motion 
with short explanations far outperformed those given only tex- 
tual information. Plass, Homer, and Hayward (2009) found that 
interactivity that allows for the manipulation of the content of 
a visualization, not just the timing and pacing, could improve 
learning outcomes compared to static materials. The authors 
suggest this is due to increased cognitive engagement (Plass et 
al., 2009). 

From the previously cited bodies of literature we distilled design 
principles for designing next generation science assessments. The 
assessments for this study were designed according to recommenda- 
tions for quality science assessments being made by science educa- 
tors, cognitive scientists, and learning theorists. Table 2 summarizes 
the design principles that were used to ensure the use of scientifically 
appropriate contexts, the alignment between tasks and learning ob- 
jectives, and the minimization of extraneous cognitive processing. 





Goal 


Design principles 





Identify scientifically appropriate context 


* Create rich environments that allow students to apply rich, interconnected knowledge 


¢ Use authentic contexts to motivate assessment 


Ensure alignment between tasks and learning objectives 


¢ Use evidence-centered design to ensure that tasks elicit evidence of proficiency for 


clearly specified learning goals 
¢ Use task structures that tap into strategic and schematic knowledge 


Minimize extraneous cognitive processing 


¢ Align the fidelity of the simulation with the task 


¢ Eliminate interesting but task-irrelevant pictures and text 

¢ Use visual cues to guide attention 

¢ Ensure the task is appropriate for the multimedia in an item 

¢ Allow users to control pace and replay of dynamic information 
¢ Signal upcoming changes in animations 
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Method 


To examine how well each modality of assessment distinguishes 
between the science practice constructs, we used three different 
analyses—a generalizability study, a Multitrait-Multimethod Con- 
firmatory Factor Analysis, and a multidimensional Item Response 
Theory (IRT) model. Specifically, we investigated the following 
question: Do student responses on assessments in different modal- 
ities (static, active, and interactive) provide different information 
about students’ proficiencies on the three science practices: knowl- 
edge of science principles, use of science principles, and ability to 
conduct scientific inquiry? 


Study Design 


To answer our research question, we developed three parallel 
assessments in the context of the life science topic of ecosystems. 
Items in the three modalities were designed to test the same 
science practice constructs and to be comparable on all key stim- 
ulus and response features—except for dynamic representations of 
science phenomena and levels of interactivity, which varied across 
the static, active, and interactive modalities. We then analyzed our 
data using multiple psychometric and statistical techniques to 
determine whether the dynamic, active, and interactive assess- 
ments were better able to independently estimate student perfor- 
mance across the three science practices (/dentifying Principles, 
Using Principles, and Conducting Inquiry). 

Participants. A total of 1,836 students (910 female, 899 male, 
and 27 of unrecorded gender) from the classrooms of 22 middle 
school science teachers in 12 states participated in the study as part 
of normal classroom activities. Teachers received a stipend for the 
time needed to complete study activities (e.g., providing demo- 
graphic information and enrolling students in the online learning 
management system). Due to absences, only 1,566 students (778 
female, 776 male, 12 unknown) completed all three versions of the 
assessment. Thus, the total sample size included in the analyses 
was 1,566. 


Bear 


Ce 
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Materials. We applied the design principles described above 
to create three assessments in the context of ecosystems, each in 
the modality that reflected a differing level of interactivity: static, 
active, or interactive. The static assessment was designed to be 
similar to traditional, multiple-choice assessments (Figures | and 
4). No part of the assessment was dynamic, that is, images and text 
were still, and the system was not responsive to student inputs. 
Visual representations for the static items were carefully chosen to 
be task relevant and minimize extraneous processing. The active 
assessment included items with dynamic displays such as anima- 
tions. For instance, students could observe organisms interact and 
watch dynamic displays of experimental trials being run in a 
simulation environment. Figures 2 and 5 demonstrate active items 
that allow students to view dynamic ecosystems. Consistent with 
the principle to allow users to control the pace of dynamic infor- 
mation, the animations could be reviewed and paused as students 
observed predator-prey interactions. Finally, the interactive assess- 
ments shown in Figures 3 and 6 went beyond dynamic displays to 
permit student input that would result in a new screen reacting to 
the student input. Similar to the interactive items created for the 
2009 NAEP Science Interactive Computer Tasks (prototypes of 
which were developed at WestEd), the interactive items created for 
the current study enabled learners to design experiments by ma- 
nipulating parameters in a simulation, collect data by running 
simulations, and draw conclusions based on seeing the outcomes 
of their tests. The use of interactive feedback and simulations takes 
advantage of technological capabilities well suited to the science 
practice of conducting inquiry. Figure 6 shows an example of the 
interactive feedback. As students mouse over an organism, the 
name of that organism is highlighted in the legend, in the form of 
highlighting related information as students hover over organisms 
or a legend. This task feature reflects the design principle of using 
visual cues (color) to guide attention. Importantly, simulations 
allow students to design experiments and demonstrate their ability 
to change one variable at a time by setting sliders and to gather 
data by running the simulation and observing graphical and tabular 


Caribou \y 


In the tundra, hares and caribou eat grass. Hares also eat 
lichens. Bears eat hares. Grass and lichens do not eat any other 
organisms. They make their own food using carbon dioxide in 
the air and water. 


Using the relationships between plants and animals described 
on the left, select the correct food web diagram. Arrows point 
FROM the food source TO the eater. 





Figure I. Ecosystem food web task: static modality. 
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Figure 2. Ecosystem food web task: active modality. 


output. Specifically, Figure 6 shows an item that allows students to 
set sliders to indicate the initial number of primary and secondary 
consumers and observe the effects of these settings on resulting 
population levels. To minimize extraneous processing, we used 
color as a visual cue to help students map between the values they 
selected and the outputs of the simulation. 

The three modalities were designed to preserve the deep struc- 
ture of each representation of the ecosystem model level (e.g., the 
same components, the same interactions among components, the 
same emergent population levels). The surface features, for exam- 
ple, the specific organisms, of each ecosystem varied. Each mo- 
dality was presented in one of the three different ecosystem con- 
texts (tundra, grasslands, and mountain lake). In the static 
modality, students viewed still images on the screen within a 


Make a food web diagram for the mountain lake. Draw arrows 
to show the transfer of matter between organisms. 


Be sure to include each organism in the food web. 





tundra ecosystem. In the active modality, students viewed anima- 
tions of a grasslands ecosystem but did not manipulate features or 
conduct active investigations. In the interactive modality, students 
identified and used ecosystem principles within a mountain lake 
ecosystem and conducted inquiry in tasks such as designing and 
running their own experiments on population levels. Each modal- 
ity assessed the same three science practice constructs (Identifying 
Principles, Using Principles, and Conducting Inquiry). 

For each of the three assessment modalities, evidence-centered 
design was used to create items that elicited evidence of profi- 
ciency on the three science inquiry constructs. Six items were 
designed to test the construct of /dentifying Principles, six items 
were designed to test Using Principles, and 12 items were de- 
signed to test Conducting Inquiry for a total of 72 items in the 


To draw an arrow, click and drag from one dot to another dot. 
To delete an arrow, double click on it. Please draw arrows 
FROM the food source TO the eater. 


You can review the animation and then return to this diagram. 


Figure 3. Ecosystem food web task: interactive modality. 
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context of ecosystems. A matrix was created with rows for each of 
the 72 items with specific statements related to the learning target 
in the context of ecosystems. For example, more specifically than 
assessing “Identifying principles,” the statement for a static item 
was “student is able to use descriptions and pictures of feeding 
relationships and correctly identify Yes/No whether each organism 
is a consumer.” The accompanying Active item statement could 
be, “student is able to observe predator prey interactions and 
identify Yes/No whether each organism is a consumer.” 

To ensure our items reflected the design principle of scientifi- 
cally appropriate content, we based our items on the interactive 
modality for ecosystems that had been developed and validated in 
prior research. The alignment of the interactive tasks with science 
practices for conducting inquiry had been established in prior 
projects. Similarly, the alignments of tasks for measuring, identi- 
fying, and using principles related to ecosystems had been estab- 
lished in prior research. Therefore, the representations of the 
components, interactions, and emergent population levels of eco- 
systems validated in our prior research formed the basis for mod- 
ifying the ecosystem representations and assessment tasks for the 
three modalities (Quellmalz, Timms, & Buckley, 2010; Quellmalz 
et al., 2011). The existing ecosystem simulation environments and 
templates for ecosystems model levels and inquiry tasks were used 
to generate parallel sets of static, active, and interactive items. 

Figures 1—3 illustrate the science constructs of Identifying Prin- 
ciples and Using Principles in a food web task in the different 
modalities. In the static modality, shown in Figure 1, students read 
text descriptions about the interactions of organisms (components) 
in an ecosystem, and students were asked to select a static image 
of the correct food web. The ability to correctly map the written 
description to a graphical food web is a component of the practice 
Identifying Principles. In the active modality, Figure 2, students 
observed an animation of organisms in an ecosystem and used 
their knowledge of the principles of organism roles (consumers, 
producers) to identify the roles of the organisms and then draw a 
food web diagram depicting the flow of energy and matter through 


Starting Values Grass, Hare, and Geer Populations 


A student wanted to know whether the starting number of bears 
affects the number of hares at Year 19. 


Will the three trials he designed allow him to answer his 
question? 


Figure 4. 
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an ecosystem. Students could replay the animation. In the interac- 
tive modality, students observed the animation of an ecosystem 
and could take advantage of additional interactivity that used 
highlighting on demand to cue the connection between the names 
and pictures of the organisms. The skill of using observations to 
generate explanations (in the form of creating a food web diagram) 
was aligned with the practice Using Principles. 

The potential confounding effects of nesting the assessments in 
a modality within a specific ecosystem (tundra, grasslands) were 
minimized by both maintaining the same structure of tasks and 
items for each modality expressed in each ecosystem and by 
focusing on assessment of science practices, not on knowledge of 
features of specific organisms or interactions that would differ 
between ecosystems. 

Figures 4-6 illustrate items designed to test the science practice 
construct Conducting Inquiry in the three different modalities. In 
the static modality, students viewed the outcomes of an investiga- 
tion, and students were asked to select an appropriate evaluation of 
the design. In the active modality, the student evaluated the design 
of an investigation after watching an animation of data being 
generated based on the design. In this active modality, students did 
not select inputs for the simulation, they only watched the simu- 
lation run. Finally, in the interactive modality, students designed 
their own investigations by setting the inputs for the simulation, 
running the simulation, observing the data tables and graphs being 
populated, and saving their own trials. 

Design and procedures. All items were administered online, 
and data were collected using the SimScientists Learning Manage- 
ment System (Quellmalz, Timms, Silberglitt, & Buckley, 2012). 
The learning management system allows teachers to check on 
individual students and researchers to download de-identified stu- 
dent data. Initial construct validity of the items was examined by 
expert reviews and cognitive labs to confirm that the tasks and 
items were eliciting the intended practices. 

Expert reviews. At two points in the process of developing the 
parallel item sets, three experts from the American Association for the 


Yes, because he changed the number of bears across 
the three trials. 


Yes, because he chose the same starting number of 
bears and hares in each trial. 


No, because he changed the number of hares across 
the three trials. 


No, because he changed the number of bears across 
the three trials. 


Investigation design task: static modality. 
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Grass, Cricket, and Lizard Populations 


Starting Values 


ae 
ma: Cricket 


Se Lizard 


Ragi-ies; 


Now the student wants to test if the starting number of lizards 
affects how the number of crickets changes in the first year. 


Click RUN to see the results of each trial. 


Figure 5. 


Advancement of Science (AAAS) independently reviewed the items 
and judged if each item was aligned with one of the targeted science 
practices of Identifying Principles, Using Principles, or Conducting 
Inquiry. These experts also verified that the items were scientifically 
accurate, grade-level appropriate, usable, and comparable across the 
static, active, and interactive versions. Initially, AAAS staff reviewed 
the storyboards of draft items and provided detailed comments 
and feedback. An additional iteration of review and revision was 
carried out with the programmed items to ensure the final items 
remained aligned with targeted science inquiry practices. 

Cognitive labs. Ten students participated in think-aloud studies 
to determine if the items elicited the targeted science inquiry practices. 
Each student completed all three forms of the assessments, one in 
each modality (static, active, interactive). As students completed the 


Starting Values 


Algae ils 150 


F258: SP gad 
a 120 160 200 


CHANGE VALUES SAVE TRIAL 


Does the starting number of alewife affect the shrimp 
population at Year 1? 


Figure 6. 
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Which trials provide information to allow the student to answer 
his question? 


Both Trials 1 and 2 
Both Trials 2 and 3 
All Trials (1, 2, and 3) 


The student cannot answer his question with the 
trials he ran, 


Investigation design task: active modality. 


assessments, they “thought aloud” by saying everything they were 
thinking while screen capture software recorded students’ verbaliza- 
tions and actions on the screen and researchers coded whether the 
items elicited the targeted construct. The think-aloud studies had two 
goals: (a) to ensure the usability of the assessments as deployed and 
(b) to provide evidence of construct validity by determining that the 
questions were eliciting student thinking and reasoning about the 
intended science practice constructs. To ensure the items would be 
usable in the field test, researchers took detailed notes of usability 
issues that arose (e.g., navigation, difficulty running experimental 
trials) and modified the items to address these issues. To examine the 
items’ construct validity, the observing researcher coded whether the 
item prompted student thinking related to the targeted science practice 
constructs. Table 3 summarizes the percentage of items in the assess- 


gm. Shemp, one alewls Pupdacers: Siarting valves algae. Shemp, sre Abvaate Popdacises 


Design and save 3 trials to test if the starting number of 
alewife affects the shrimp population at Year 1. 


¢ Set the values on the sliders. 
Click RUN, 
Save the results or click CHANGE VALUES to design a 
different trial. 
You can use the Data Inspector (/\) to explore the graph. 


Investigation design task: interactive modality. 
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Table 3 
Percentage of Items Judged to Elicit Their Intended Construct 
Targets 





Identifying Using Conducting 
Modality principles principles inquiry 
Static % 100 97 98 
Active % 98 98 100 
Interactive % 98 98 98 





ments judged to elicit their intended construct targets. These data 
provided one form of evidence that the items were aligned with their 
intended content and inquiry targets. 

Procedures. The study used a within-subjects design, as all 
participating students took all three modalities of the ecosystems 
assessments, that is, one period of static items, one period of active 
items, and one period of interactive items. The assessments were 
given in three consecutive sessions, and the order of the sessions 
was fully counterbalanced at the class level across the six possible 
sequences of assessments: static (S), active (A), and interactive (I; 
SIA, ISA, IAS, AIS, ASI). Table 4 shows the number of items in 
each test session. 

Participating teachers received a detailed “Teacher Guide” that 
outlined step-by-step the processes and procedures for study activities, 
and teachers were able to view online movies that demonstrated how 
to carryout the online processes and procedures necessary for partic- 
ipation. During the study, students either used laptops in their science 
classrooms or went to the school computer lab. 


Analyses 


In order to answer the research question, we used three different 
analyses to compare how well each assessment modality measured 
the three science practice constructs: (a) a generalizability study 
(G-study), (b) a Multitrait-Multimethod Confirmatory Factor Analy- 
sis (MTMM/CEFA) and (c) a multidimensional Item Response Theory 
(MIRT) model. These three statistical methods employ models with 
successively stronger statistical assumptions. A G-study treats items 
as randomly sampled from a universe of potential items. It makes 
virtually no statistical assumptions beyond minimal assumptions that 
certain error components are uncorrelated. The MTMM/CFA imposes 
a specific theoretical model to test predictions about patterns of 
intercorrelations among the nine constructs (three science practice 
constructs by three assessment modalities). It is at a higher level of 
aggregation, moving from the item level to the level of nine composite 
scores that were formed according to the a priori framework under- 
lying the instrument. The MIRT model returns to the item level but 


Table 4 
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imposes much stronger assumptions about the functional form of item 
characteristics as well as assumptions about the grouping of items into 
scales. More information for each modeling approach is described 
below. 

Generalizability study. Generalizability analysis was chosen 
as the first analytic method. Generalizability studies (G-studies) 
extend the reliability analyses of classical test theory in powerful 
ways, making possible the quantitative investigation of multiple 
sources of error in a data set. Of particular relevance for the present 
study, multivariate G-studies enable simultaneous investigation of 
measures tapping multiple constructs, accounting for sources of 
correlated error across constructs. In the studies conducted here, 
G-studies indicate the magnitude of error variance components 
attributable to items as well as the interaction of persons by items 
plus residual variance. 

The multivariate G-study analysis was conducted using the 
mGENOVA computer program (Brennan, 2001). Given that there 
were three ecosystem contexts for tasks (Grasslands, Tundra and 
Mountain Lake) corresponding to three item modalities (static, 
active, and interactive) and that each task consisted of items 
addressing three different science practice constructs (identifying, 
using, conducting), the analysis was a multivariate G-study treat- 
ing the nine modalities by construct combinations as nine separate 
constructs. In the mGENOVA terminology, this is a (p@ X 1°) 
design. The p facet represents persons (students). The solid circle 
means that the same students responded across all nine constructs. 
The i represents items. The open circle means that the items were 
nested within each of the nine constructs. This p@ X 1° notation 
represents the univariate design for each of the nine separate 
constructs. These constructs were specified and treated as a mul- 
tivariate fixed facet in mGENOVA. 

Multitrait-multimethod confirmatory factor analysis. The 
second type of analysis selected to answer the research question 
was a multitrait-multimethod CFA analysis (Campbell & Fiske, 
1959; Loehlin, 1998), which attempts to separate out the true 
variance on measured traits (the underlying constructs being mea- 
sured) from the variance that is due to the method of measurement 
(the modality). It is well-suited to this study because the same 
traits/constructs (the three science practices) were measured with 
three different methods (the static, active, and interactive modal- 
ities). The resulting correlations among the different measurements 
were then arranged in a multitrait-multimethod matrix in order to 
assess the convergent validity, the tendency for different measure- 
ment methods to converge on the same trait/construct, and the 
discriminant validity, the ability to discriminate among different 
traits/constructs. 


Number of Items in Each Test Session by Modality Format and Science Practice Construct 
eee 


Science practice constructs 


Test Session B Test Session C 





Identifying science principles 
Using science principles 
Conducting science inquiry 
Total items 


Test Session A active interactive Total 

static modality modality modality items 
6 6 6 18 
6 6 6 18 
2 (> 12 36 
4 


Ne 


24 24 72 
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This analysis was used to test the following two hypotheses that 
stem from the research question: 


1. The factor loadings for method (assessment modality) are 
higher than for trait (science practice constructs). 


2. The factor loadings from assessment modality to the 
three science practice constructs are less for the interac- 
tive modality than for the static and active modality. 


The first hypothesis reflects our belief that the assessment 
modality is important in the assessment of science practices. If the 
factor loadings on the methods (assessment modality) are higher 
than those for the trait (science practice constructs), it tells us that 
the student scores are determined more by the assessment modality 
than by the science practice constructs being assessed (this is what 
we expect). If, on the other hand, the factor loadings are higher on 
the trait, it tells us that the science practice constructs are more 
important in determining student scores and the assessment mo- 
dality is less important. 

The second hypothesis examines which assessment modality mea- 
sures more distinct science practice constructs. Higher correlations 
among science practice constructs for an assessment modality result 
in stronger factor loadings. This would indicate that the traits are not 
being measured distinctly under a certain assessment modality. We 
expect that the interactive modality measures the science practice 
constructs more distinctly than the other two modalities. 

The multitrait-multimethod analysis used the same data set of 
1,566 complete responses as was used for the G-study analyses. 
However, instead of using the responses to 72 individual items, 
nine composite scores were computed and used in the CFA. This 
was to simplify the analysis and the interpretation and to reduce 
the possibility of the estimation not being able to converge during 
analysis. The nine composite scores were simply the sum of the six 
items for /dentifying Principles, the six items for Using Principles 
items, and the sum of 12 items for Conducting Inquiry items, 
computed for each of the three modalities (static, active, and 
interactive). The analysis was conducted following the procedures 
discussed on pp. 101-105 in Loehlin (1998). The Mplus ' com- 
puter program was used for this analysis. 

Multidimensional item response theory model. The third 
type of analysis used a multidimensional IRT model to evaluate 
how well each assessment modality (static, active, or interactive) 
was able to measure and separate student performances on the 
three sciences practice constructs. IRT models are probabilistic 
models in which item difficulty (a test item’s underlying difficulty 
based on the proportion of a given sample that responded cor- 
rectly) and person measure (a person’s underlying competence, 
based on the proportion of items completed correctly) are simul- 


Table 5 
Estimated Correlations Among the Three Science Practices for 


the Static Mode (Tundra) 








Science 

practice Identifying Using Conducting 
Identifying 1 0.92 0.80 
Using | 0.91 
Conducting | 








1109 

Table 6 
Estimated Correlations Among the Three Science Practice 
Constructs for the Active Mode (Grasslands) 

Science d 

practice Identifying Using Conducting 
Identifying 1 0.80 0.80 
Using 1 1 
Conducting 1 





taneously estimated. The result is a scale on which both persons 
and items are mapped onto the theoretical latent traits, which in 
this case are the science practices constructs. The fact that IRT 
scores’ accuracy and precision can be quantified makes this a 
suitable analytic method in this study to determine how well each 
of the three modalities of assessment measure the three science 
practice constructs. 

The ACER ConQuest* generalized item response modeling pro- 
gram was used to run a multidimensional logistic model analysis 
that modeled the three science practice constructs separately for 
each of the three modalities of assessment. This allowed the 
correlations among the three science practice constructs to be 
estimated and the reliability of the measurement of each of the 
practice constructs to be quantified for the three modalities of 
assessment. 


Results 


G-study. We first summarize the estimated correlations 
among the three science practice constructs for each of the three 
modalities of assessment (see Tables 5—7). The estimated correla- 
tions in the G-study are corrected for attenuation due to unreli- 
ability, that is, estimated correlations among universe scores (true 
scores) for the nine constructs.* 

These correlation coefficients can be used to examine which 
modality of assessment appears to measure more distinct science 
practice constructs. While we expect there to be a positive corre- 
lation among the three individual science practice constructs 
within the assessment modality (because they are related elements 
of the overall set of science practices, if they are clearly observable 
skills), the correlations should not be too high if the assessment is 
designed to measure multiple distinct constructs. 

The results suggest that while similar patterns (correlations 
among science practice constructs) are found between assessment 
modes, the results in Table 7 show lower correlations between the 
constructs when tested by the interactive modality. The correlation 
between identifying and using for the interactive modality is 0.82 
versus ().92 and 0.80 for the static modality and the active modal- 
ity, respectively. Using one minus the correlation as a measure of 
dissimilarity (it also represents the proportion of variance unique 
to each measure), we can see that moving from 0.82 (interactive) 
to 0.92 (static) represents a reduction in the variation unique to 


' http://www.statmodel.com/ 

? ACER ConqQuest: Generalized Item Response Modeling Software 
published and distributed by the Australian Council for Educational 
Research. 

* For this reason, the conventional methods to test the difference be- 
tween observed correlations would not be appropriate. 
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Table 7 
Estimated Correlations Among the Three Science Practice 
Constructs for the Interactive Mode (Mountain Lake) 








Science 

practice Identifying Using Conducting 
Identifying 1 0.82 0.72 
Using 1 0.84 
Conducting 1 





each modality from 18% to 8%, about 56% (10/18) reduction. This 
indicates that the correlation between identifying and using is more 
profound for the static modality than the interactive (or active) 
modality. Similarly, the correlation between identifying and con- 
ducting or the correlation between using and conducting is more 
apparent for the static/active modality than the interactive modal- 
ity. Overall, the results suggest that the interactive modality mea- 
sured the science practice constructs more distinctly than the other 
two modalities. 

Table 8 summarizes the estimated variance components and 
G-coefficient by construct and indicates the percentage of variance 
attributable to each effect specified in the design. For example, 
under the static assessment modality for the Identifying Principles 
construct, about 20% of variance is contributed by persons (stu- 
dents), about 16% is from items, and 64% (the majority of vari- 
ance) is accounted for by the person and item interaction. In 
general, this pattern holds for each construct. Note that these 
percentages shown in Table 8 are typical. A large percentage of 
variance for the person by item interaction is unsurprising as these 
percentages refer to single items, and a test with a single item is not 
expected to be reliable. In practice, items are always combined into 
a scale, and the person by item variance is divided by the number 
of items included in the scale (so more items would yield less 
variance for the person by item interaction). 

The bottom row of Table 8 shows the G-coefficient estimate (a 
reliability-like coefficient) for each science practice construct mea- 
sured under each of the three assessment modalities. The 
G-coefficients for the Identifying and Using constructs are based 
on a six-item test within each modality, whereas the coefficients 
for Conducting are based on a 12-item test (of multiple component 
skills) within each modality. There is little difference between the 


Table 8 
Estimated Variance Components and G-Coefficient by Construct 


G-coefficients for the static, active, and interactive modes in 
measuring the /dentifying and Using constructs, but there is a 
noticeable difference for Conducting. The results suggest that the 
interactive mode produced a higher reliability (.79) than static 
(.68) or active (.66). This suggests the interactive modality pro- 
duced a more reliable measure for Conducting than did the other 
two modalities. 

Multitrait-multimethod confirmatory factor analysis. 
Figure 7 shows the path model for the Multitrait-multimethod 
CFA. In the diagram, the top three circles represent the construct 
factors included in the model for the three science practices (iden- 
tify, use and conduct). The bottom three circles represent the 
modality factors for the three assessment modalities (static, active, 
and interactive). The nine rectangles represent the sets of items. 
The first three on the left represent the items that targeted the 
Identifying Principles (identify) science practice, and from left to 
right they represent the static, active, and interactive item sets. The 
middle three rectangles represent the Using Principles (use) items, 
again with static, active, and interactive modalities running from 
left to right. The final three rectangles on the right represent the 
Conducting Inquiry (conduct) items, arranged yet again with static, 
active, and interactive modalities running from left to right. The 
straight lines represent the loadings of factors onto items, with 
the standardized value represented by the figure against the line. 
These factor loadings are summarized in Tables 9 and 10 below. 
The curved arrows represent the correlations between factors, 
with the value on each arc. 

Our study question can be answered by looking at Tables 9 and 
10. The factor loadings of assessment modalities to science prac- 
tice items (see Table 10) are generally higher than the factor 
loadings of science practice factors (see Table 9). The findings 
indicate that the scores are determined more by the assessment 
modalities than by the science practice constructs. That is, consis- 
tent with our hypothesis, the modality of assessment does have an 
impact on how well items are able to draw out students’ knowI- 
edge and skills in the three science practices. 

By further looking at the factor loadings in Table 10, we also 
find that the factor loadings from the test modality to the three 
constructs are slightly less for the interactive modality (.698 on 
average) than for the static (.732 on average) and active modalities 
(.719 on average). This is supportive of our hypothesis that the task 














Identifying Using Conducting 
Effect Static Active Interactive Static Active Interactive Static Active Interactive 

Persons 

Estimated variance 0.04 0.05 0.05 0.04 0.03 0.03 0.03 0.03 0.05 

Percentage of total 19.62 22.27 20.67 16.48 14.83 14.79 11.97 11.59 18.11 
Items 

Estimated variance 0.04 0.04 0.04 0.01 0.02 0.03 0.05 0.04 0.06 

Percentage of total Ao 3) 14.64 Neal) 5.96 WMG 12.30 19.92 16.33 22.49 
Person X Item interaction 

Estimated variance Ons 0.15 0.16 (OL 7 0.17 0.17 0.17 0.17 0.15 

Percentage of total 64.45 63.09 64.18 77.56 78.00 72.91 68.11 72.08 59.40 
G-coefficient 0.65 0.68 0.66 0.56 (O58 0.55 0.68 0.66 0.79 
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Figure 7. 


items in the interactive modality measure the science practice 
constructs more distinctly than the other two modalities. In par- 
ticular, looking at the factor loadings in the column for the Con- 
ducting Inquiry construct in Table 10, the loading for the interac- 
tive modality (.673) is considerably lower than for the static (.702) 
and active (.739) modalities. This is similar to the finding in the 
G-study that the interactive modality more oa measures the 
Conducting Inquiry construct. 

Table 11 shows the multitrait-multimethod (MTMM) correla- 
tion matrix based on CFA (the factor loadings and intercorrelations 
among factors are shown in Figure 7). It provides information 
about how each science practice is tied to each assessment modal- 
ity. The figures that are in bold represent the within-modality, 
cross-science practice correlations. These values are similar as 
revealed in Table 10. 

Multidimensional IRT. Table 12 shows a type of reliability 
coefficient estimated by the Expected A-Posteriori (EAP)/Plausi- 
ble Values (PV) derived from the MIRT analysis. The EAP/PV 
reliability coefficient represents how well the persons (students) 
are separated by the measures of each of the science practice 
constructs. As seen in the G-study and the MTMM/CFA, the 
differences among the assessment modalities are relatively small 
for the Identifying and Using Principles constructs, but the Inter- 


Table 9 
Factor Loadings of Science Practice Factors on the Sets of 
Items Represented by Three Assessment Modalities 





Items by assessment modality 








Science practice factors Static Active Interactive 
Identifying 0.372 0.578 0.514 
Using 0.000 0.081 0.095 
Conducting 0.365 0.369 0.470 


Ns —— 


B82 


Path model for the multitrait-multimethod confirmatory factor analysis. 


active modality has a higher reliability coefficient (.82) for the 
Conducting construct. Again, this points to the fact that the inter- 
active assessment modality was more reliable in measuring the 
Conducting science construct than the other two assessment 
modalities. 


Discussion and Implications 


With the increasing interest in the use of technology to create 
assessments that measure skills that are hard to assess in traditional 
static modalities, this study suggests that engaging students in 
interactive assessments may provide a better estimate of their more 
complex inquiry practices than active or static formats do. Such 
interactive modalities are currently not widely used in science 
assessment, if at all. The study is grounded in research-based 
principles and literature sources for informing the design of next 
generation assessments and an empirical study of the affordances 
of dynamic and interactive modalities for measuring distinct sci- 
ence practice constructs. These outcomes of the Foundations of 
21st Century Science Assessments project provide guidelines for 
designing the next generation of science assessments and evidence 
supporting claims that the affordances of dynamic, interactive, 


Table 10 
Factor Loadings of Assessment Modality Factors on the Sets of 
Items Represented by Three Science Practices 





2 Items by science practices 
Assessment modality 


factors Identifying Using Conducting 
Static 0.702 0.793 0.702 
Active 0.654 0.766 0.739 
Interactive 0.664 0.758 0.673 
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Multitrait-Multimethod Correlation Matrix Based on CFA 
a a a a et a at SN a ce, 


Science 
practice Modality Static Active Interactive 
Identifying Static 1.00 
Active 0.62 1.00 
Interactive 0.60 0.69 1.00 
Using Static 0.56 0.00 0.00 
Active —0.01 0.49 —0.01 
Interactive SOO ce 0.49 
Conducting Static 0.53 0.06 0.05 
Active 0.04 0.54 0.05 
Interactive 0.05 0.07 0.51 


Identifying 











Using Conducting 

Static Active Interactive Static Active Interactive 
1.00 

0.54 1.00 

0.53 0.53 1.00 

0.56 0.02 0.03 1.00 

0.00 0.59 0.03 0.60 1.00 
0.00 0.03 0.55 0.59 0.62 1.00 





Note. The figures that are in bold represent the within-modality, cross-science practice correlations. CFA = confirmatory factor analysis. 


complex assessment tasks can improve the measurement of sci- 
ence inquiry practices. 

We note that the design principles presented in this article relate 
to the types of summative assessment in the three modalities 
compared in this study. Literature on the affordances of dynamic, 
interactive modalities for formative and adaptive purposes was 
synthesized by the project and will be reported elsewhere. Rela- 
tively little research has studied the interaction of multiple media 
such as text, graphics, and static and dynamic perceptual cuing in 
complex tasks. There is considerable research to be done on the 
functions of multiple representations and interactive interfaces in 
learning and assessments of science systems and practices (Buck- 
ley & Quellmalz, 2013). 

This study provides rare large-scale evidence that interactive 
assessments may be more effective than static assessments at 
discriminating student proficiencies across different types of sci- 
ence practices. Studies comparing item formats have primarily 
been within the static modality (selected vs. constructed responses) 
or between performance assessments and conventional tests. This 
study extends the comparison of task and item design to complex 
tasks involving inquiry practice constructs and the dynamic and 
interactive affordances of technology-based complex science as- 
sessment tasks. The three modality versions compared (static, 
active, and interactive) were carefully constructed to keep the 
representations of the science ecosystem parallel. Thus, all three 
versions depicted the ecosystems with parallel stylistic images of 
the organisms, tables, graphs, and screen layouts. In contrast, 
typical ecosystem items in extant tests tend to vary in the repre- 
sentation of the ecosystem, for example, by presenting a food web 
as a set of boxes, text organism names, or pictures of organisms. 
This study aimed for comparable representations of the ecosystems 


Table 12 
EAP/PV Reliability Coefficients for Static, Active, and 
Interactive Modalities 








Modality Identifying Using Conducting 
Static 16 J) ae 
Active 74 ao ld 
Interactive 14 me 82 


Note. EAP/PV = expected a posteriori/plausible values. 


so that the research variables would be the extent of learner control 
(static, active, interactive) and the dynamic level of the ecosystem 
presentation, that is, a still image, an animation, or a dynamic, 
changing display. Research on design variations within next gen- 
eration assessments will face similar methodological challenges. 

The study of alternative complex task and item formats also 
presents analysis challenges. In this study, a combination of meth- 
ods—a generalizability study, MIRT, and confirmatory factor 
analyses— examined the measurement properties of the modalities 
through different lenses. Examining the convergent, discriminate, 
and construct validity of complex, dynamic assessments poses 
challenges for the measurement community. 


Conclusions 


This project integrated research on learning in rich multimedia 
environments with evidence-centered assessment design methods to 
shape a framework for developing and establishing the technical 
quality of reusable task designs for assessing complex science learn- 
ing. We believe this will make a significant contribution to the field by 
moving the state of technology-based item development to more 
principled practice through identifying relevant findings from re- 
search on model-based reasoning and multimedia learning that affect 
the design of assessments of learning, retrieval, and transfer. 

Current science tests do not address some of the valued knowledge 
and practices called for in the new Framework for K-12 Science 
Education and draft Next Generation Science Standards. Therefore, 
the next generation of science assessments will need to address both 
a broader range of standards and innovative methods for assessing 
them. 

The study provides much needed empirical evidence of the affor- 
dances of dynamic and interactive assessments for discriminating 
among science knowledge and inquiry skills. The results suggest that 
static assessments are not as effective as interactive assessments for 
differentiating between factual knowledge and the ability to apply that 
knowledge in meaningful contexts. Our study found that the interac- 
tive task sets that served as a basis of the interactive assessments were 
more effective than either static or active assessments at uniquely 
measuring students’ ability to engage in inquiry practices. Therefore, 
assessment developers who wish to design assessments of science 
inquiry skills should consider the use of active and interactive assess- 
ment tasks. 


NEXT-GENERATION ENVIRONMENTS FOR SCIENCE LEARNING 
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My Science Tutor (MyST) is an intelligent tutoring system designed to improve science learning by 
elementary school students through conversational dialogs with a virtual science tutor in an interactive 
multimedia environment. Marni, a lifelike 3-D character, engages individual students in spoken dialogs 
following classroom investigations using the kit-based Full Option Science System program. MyST 
attempts to elicit self-expression from students; process their spoken explanations to assess understand- 
ing; and scaffold learning by asking open-ended questions accompanied by illustrations, animations, or 
interactive simulations related to the science concepts being learned. MyST uses automatic speech 
recognition, natural language processing, and dialog-modeling technologies to interpret student responses 
and manage the dialog. Sixteen 20-min tutorials were developed for each of 4 areas of science taught in 
3rd, 4th, and Sth grades. During summative evaluation of the program, students received one-on-one 
tutoring via MyST or an expert human tutor following classroom instruction on the science topic, 
representing over 4.5 hr of tutoring across the 16 sessions. A quasi-experimental design was used to 
compare average learning gain for 3 groups: human tutoring, virtual tutoring, and no tutoring. Learning 
gain was measured using standardized assessments given to students in each condition before and after 
each science module. Results showed that students in both the human and virtual tutoring groups had 
significant learning gains relative to students in the control classrooms and that there were no significant 
differences in learning gains between students in the human and MyST human tutoring conditions. Both 


teachers and students gave high-positive survey ratings to MyST. 


Keywords: intelligent tutors, spoken dialog, science learning 


According to the 2009 National Assessment of Educational 
Progress (NAEP, 2005), only 34% of fourth graders, 30% of 
eighth graders, and 21% of 12 graders tested as proficient in 
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science, with 1%—2% of these students demonstrating advanced 
knowledge of science in these grades. Thus, over two thirds of 
U.S. students are not proficient in science. The vast majority of 
these students are in low-performing schools that include a high 
percentage of disadvantaged students from families with low 
socioeconomic status, which often include English learners 
with low English-language proficiency. Analysis of the NAEP 
scores in reading, math, and science over the past 20 years 
indicate that this situation is getting worse. For example, the 
gap between English learners and English-only students, which 
is over one standard deviation lower for English learners, has 
increased rather than decreased over the past 20 years. More- 
over, science instruction is often underemphasized in U.S. 
schools, with reading and math being stressed. My Science 
Tutor (MyST) was designed to address this problem by immers- 
ing students in a multimedia environment with a virtual science 
tutor that was designed to behave like an engaging and effective 
human tutor. The focus of the program is to improve each 
student’s engagement, motivation, and learning by helping 
them learn to visualize, reason about, and explain science 
during conversations with the virtual tutor. 

The learning principles embedded in MyST are consistent with 
conclusions and recommendations of the National Research Coun- 
cil Report, “Taking Science to School: Learning and Teaching 
Science in Grades K-8” (Duschl, Schweingruber, & Shouse, 
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2007), which emphasizes the critical importance of scientific dis- 
course in K-12 science education. The report identifies the fol- 
lowing crucial principles of scientific proficiency: 


Students who are proficient in science: 1. know, use, and interpret 
scientific explanations of the natural world; 2. generate and evaluate 
scientific evidence and explanations; 3. understand the nature and 
development of scientific knowledge; and 4. participate productively 
in scientific practices and discourse. (p. 2) 


The report also emphasizes that scientific inquiry and discourse is 
a learned skill, so students need to be involved in activities in 
which they learn appropriate norms and language for productive 
participation in scientific discourse and argumentation. 

In a meta-analysis of 18 studies by Chi (2009), the author 
examined student learning along the continuum active, construc- 
tive, interactive. Active tasks include “doing something,” such as 
participating in a classroom science investigation. Constructive 
tasks include “producing something,” such as a written report 
describing the results of the investigation. Interactive tasks require 
discourse and argumentation with a peer or tutor. Chi’s analysis of 
the research studies produced strong evidence that interactive tasks 
produce the greatest learning gains. 

A substantial body of research indicates that engaging in dis- 
course and argumentation about science is one of the most chal- 
lenging tasks for young learners, and one of the most important 
and beneficial skills for them to acquire (Hake, 1998; Murphy, 
Wilkinson, Soter, Hennessey, & Alexander, 2009; Osborne, 2010; 
Soter et al., 2008). However, evidence also indicates that authentic 
conversations are extremely rare across all content areas in U.S. 
classrooms (Cazden, 1988; Gamoran & Nystrand, 1991; Nystrand, 
1997). As Osborne (2010) noted, “Argument and debate are com- 
mon in science, yet they are virtually absent in science education” 
(p. 463). Our goal in designing MyST was to provide students with 
the scaffolding, modeling, and practice they need to learn to reason 
and talk about science. 

MyST is an intelligent tutoring system intended to provide an 
intervention for third-, fourth-, and fifth-grade children who are 
struggling with science. In our study, it was used as a supplement 
to normal classroom instruction using the Full Option Science 
System (FOSS). FOSS is an inquiry-based science program that is 
based on the idea that “The best way for students to appreciate the 
scientific enterprise, learn important scientific concepts, and de- 
velop the ability to think well is to actively construct ideas through 
their own inquiries, investigations and analyses” (FOSS, n.d., para. 
3). It has been under development since 1988, and is in use in 
every state in the United States. Twenty-six science modules have 
been developed for Grades K—6. The learning objectives in each 
FOSS module are aligned to the National Science Education 
Standards and standards for most states. Each module covers an 
integrated area of science (e.g., Mixtures and Solutions, Measure- 
ment, Variables). The instructional materials for each module are 
packaged in a kit that contains the materials needed to conduct the 
classroom science investigations: a teacher guide, a module- 
specific teacher-preparation video, and a summative assessment 
(Assessing Science Knowledge [ASK]) to be administered before 
and after each science module. 

Within a science module, students in classrooms work in small 
groups to conduct a series of approximately 16 science investiga- 
tions over an 8- to 10-week period. These hands-on investigations 
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are aligned to specific science concepts and learning objectives. 
The structure of the FOSS program provides an ideal test bed for 
research and evaluation of MyST, with MyST dialogs being 
aligned with specific classroom science investigations, learning 
objectives, science standards, and ASK assessments. 


Research Motivating the Design of MyST Dialogs 


MyST is an example of a new generation of intelligent tutoring 
systems that facilitate learning through natural spoken dialogs with 
a Virtual tutor in multimedia activities. Intelligent tutoring systems 
aim to enhance learning achievement by providing students with 
individualized and adaptive instruction similar to that provided by 
a knowledgeable human tutor. These systems support typed or 
spoken input, with the system presenting prompts and feedback via 
text, a human voice, or an animated pedagogical agent (Graesser, 
VanLehn, Rosé, Jordan, & Harter, 2001; Lester et al., 1997; 
Mostow & Aist, 2001; VanLehn et al., 2007; Wise et al., 2005). 
Text, illustrations, and animations may be incorporated into the 
dialogs. Research studies show up to one sigma gains (approxi- 
mately equivalent to an improvement of one letter grade) when 
comparing performance of high school and college students who 
use the tutoring systems with students who receive classroom 
instruction on the same content (Graesser et al., 2001; VanLehn & 
Graesser, 2001; VanLehn et al., 2005). In a recent synthesis of 
research that compared learning gains following human tutoring or 
following use of an intelligent tutoring system, VanLehn (2011) 
concluded that human tutoring and intelligent tutoring systems 
produce approximately the same effect size, with human tutoring 
at d = 0.79 and intelligent tutoring systems at d = 0.76. 

The development of MyST is informed by several decades of 
research in psychology and computer science. In the remainder of 
this section, we briefly describe theory and research that informed 
the design of MyST. 


Benefits of Tutorial Instruction 


Theory and research provide strong guidelines for designing 
effective tutoring dialogs. Over two decades of research have 
demonstrated that learning is most effective when students receive 
individualized instruction in small groups or one-on-one tutoring. 
Bloom (1984) determined that the difference between the amount 
and quality of learning for students who received classroom in- 
struction and those who received either one-on-one or small-group 
tutoring was two standard deviations. Evidence that tutoring works 
has been obtained from dozens of well-designed research studies, 
meta-analyses of research studies (Cohen, Kulik, & Kulik, 1982), 
and positive outcomes obtained in large-scale tutoring programs 
(Madden & Slavin, 1989; Topping & Whiteley, 1990). 

Benefits of tutoring can be attributed to several factors, includ- 
ing the following: 

Question generation. A significant body of research shows 
that learning improves when teachers and students ask deep-level- 
reasoning questions (Bloom, 1956). Asking authentic questions 
leads to improved comprehension, learning, and retention of texts 
and lectures by college students (Craig, Gholson, Ventura, & 
Graesser, 2000; Driscoll et al., 2003; King, 1989) and school 
children (King, 1994; King, Staffieri, & Adelgais, 1998; Palinscar 
& Brown, 1984). 
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Generating explanations. Research has demonstrated that 
having students produce explanations improves learning (Chi et 
al., 1989; Chi, Siler, Jeong, Yamauchi, & Hausmann, 2001; King, 
1994: King et al., 1998; Palinscar & Brown, 1984). In a series of 
studies, Chi et al. (1989, 2001) found that having college students 
generate self-explanations of their understanding of physics prob- 
lems improved learning. Self-explanation also improved learning 
about the circulatory system by eighth-grade students in a con- 
trolled experiment (Chi, De Leeuw, Chiu, & LaVancher, 1994). 
Hausmann and Van Lehn (2007a, 2007b) note that “self- 
explaining has consistently been shown to be effective in produc- 
ing robust learning gains in the laboratory and in the classroom” 
(2007b, p. 1067.) Experiments by Hausmann and Van Lehn 
(2007b) indicate that it is the process of actively producing expla- 
nations, rather than the accuracy of the explanations, that makes 
the biggest contribution to learning. 

Knowledge coconstruction. Students coconstruct knowledge 
when they are provided with the opportunity to express their ideas 
and to evaluate their thoughts in terms of ideas presented by others. 
There is compelling evidence that engaging students in meaningful 
conversations improves learning (Butcher, 2006; Chi et al., 1989; 
King, 1994; King et al., 1998; Murphy et al., 2009; Palinscar & 
Brown, 1984; Pine & Messer, 2000; Soter et al., 2008). 


Social Constructivism 


In social constructivism, learning is viewed as an active social 
process of constructing knowledge “that occurs through processes 
of interaction, negotiation, and collaboration” (Palincsar, 1998, p. 
365). Vygotsky (1978) stressed the critical role of social interac- 
tion within one’s culture in acquiring the social and linguistic tools 
that are the basis of knowledge acquisition. “Learning awakens a 
variety of internal developmental processes that are able to operate 
only when the child is interacting with people in his environment” 
(Vygotsky, 1978, pp. 89-90). He stressed the importance of hav- 
ing students learn by presenting problems that enable them to 
scaffold existing knowledge to acquire new knowledge. Vygotsky 
introduced the concept of the zone of proximal development, “the 
distance between the actual developmental level as determined by 
independent problem solving and the level of potential develop- 
ment as determined through problem solving under adult guidance 
or in collaboration with more capable peers” (Vygotsky, 1978, p. 
86). Social constructivism provides the conceptual model for 
knowledge acquisition in MyST: to improve learning by scaffold- 
ing conversations using open-ended questions and media to sup- 
port hypothesis generation and coconstruction of knowledge. 


Discourse Comprehension Theory 


Cognitive learning theorists generally agree that learning occurs 
most effectively when students are actively engaged in critical 
thinking and reasoning processes that cause new information to be 
integrated with prior knowledge. Discourse comprehension theory 
(Kintsch, 1988, 1998) holds that deep learning requires integration 
of prior knowledge with new information and results in the ability 
to use this information constructively in new contexts. To the 
extent possible, MyST attempts to determine relevant information 
that students know and build on that lead students to correct 
explanations. 


LAT 


Social Agency and Pedagogical Agents 


When human computer interfaces are consistent with the social 
conventions that guide our daily interactions with other people, 
they provide more engaging, satisfying, and effective user expe- 
riences (Nass & Brave, 2005; Reeves & Nass, 1996). Such pro- 
grams foster social agency, enabling users to interact with them the 
way they interact with people. In comparisons of programs with 
and without talking heads or human voices, children learned more 
and reported more satisfaction using programs that incorporated 
virtual humans (Atkinson, 2002; Baylor & Kim, 2005; Moreno, 
Mayer, Spires, & Lester, 2001). A number of researchers have 
observed that children become highly engaged with virtual tutors 
and appear to interact with a virtual tutor as if it were a real teacher 
and appear motivated to work hard to please it. Lester (Lester et 
al., 1997) termed this phenomenon the “persona effect.” 


Multimedia Learning 


During MyST dialogs, students are encouraged to construct 
explanations of science presented in illustrations, silent anima- 
tions, and interactive simulations. The design of these dialogs is 
consistent with research indicating that combining spoken expla- 
nations with media can optimize science learning, either during 
multimedia presentations (Horz & Schnotz, 2010; Mayer, 2001, 
2005) or when students are required to generate explanations in 
multimedia learning environments (Roy & Chi, in press). In a 
series of studies, Mayer (2001) investigated students’ ability to 
learn how things work (motors, brakes, pumps, lightning) when 
information was presented in different modalities (e.g., text with 
illustrations, or narration of the text during which a spoken voice 
explained the information presented in an illustration or sequence 
of illustrations). A key finding of Mayer’s work is that simultane- 
ously presenting speech (narration) with nonverbal visual infor- 
mation (a sequence of illustrations or an animation) results in the 
highest retention of information and the application of knowledge 
to new problems. Mayer (2001) argued that when a person is 
presented with a well-designed narrated animation, the listener is 
able to construct an enriched multimodal representation of the two 
sources of input, leading to superior recall and transfer of knowl- 
edge to new tasks. Roy and Chi (in press), based on a review of the 
literature on self-explanations in multimedia environments, sug- 
gest that 


many learners would benefit from self-explanation training or prompt- 
ing within multimedia environments. Essentially, we have argued that 
because they are information rich, multimedia environments afford 
the generation of many opportunities for explaining encoded infor- 
mation and accessing and relating prior knowledge. (p. 27) 


Dialog Interaction 


The design of spoken dialogs in MyST is based on a number of 
principles used in Questioning the Author (QtA), an approach to 
classroom discussions developed by Isabel Beck and Margaret 
McKeown (Beck, McKeown, Sandora, Kucan, & Worthy, 1996; 
McKeown & Beck, 1999; McKeown, Beck, Hamilton, & Kucan, 
1999). During the 3-year period in which MyST dialogs were 
designed, tested, and refined, we worked with QtA codeveloper 
Margaret McKeown to apply principles of QtA to spoken dialogs 
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with Marni that incorporate illustrations, animations, and interac- 
tive simulations to help students visualize the science they are 
trying to explain. 

QtA is a mature, scientifically based, and effective program 
used by hundreds of teachers across the United States. It is de- 
signed to improve comprehension of narrative or expository texts 
that are discussed as they are read aloud in the classroom. The 
focus is to have students grapple with, and reflect on, what an 
author is trying to say in order to build a representation from the 
text. The approach uses open-ended questions to initiate discussion 
(What is the author trying to say?) to help students focus on the 
author’s message (That’s what she says, but what does she mean?) 
to help students link information (How does that fit with what the 
author already told us?) and to help the teacher guide students 
toward comprehension of the text. 

QtA provides a good basis for tutorial interaction in the MyST 
virtual tutoring system because (a) research shows that it is effec- 
tive for improving comprehension (Murphy & Edwards, 2005); (b) 
it provides a framework and planning process that helps define 
learning goals and develops an orderly sequence for getting stu- 
dents to achieve the goals; (c) it offers ways to design prompts that 
draw student attention to relevant portions of presented material, 
but that are open enough to leave the identification of the material 
to students; (d) it provides a principled, easily understandable and 
well-documented program for teachers or tutors to elicit and re- 
spond to student responses that helps them learn to focus on and 
make connections between meaningful elements of the discourse 
and their own experiences; and (e) it focuses on comprehension, 
with discussion of student personal views and experiences limited 
to those that can directly enhance building meaning from texts, 
lectures, multimedia presentations, data sets, or hands-on learning 
activities. 

Murphy and Edwards (2005) analyzed the results of research 
studies that met rigorous scientific criteria for evaluating programs 
designed to improve student learning through classroom conver- 
sations. Of the nine programs that met the scientific criteria for 
valid research studies, QtA was identified as one of two ap- 
proaches that is likely to promote high-level thinking and compre- 
hension of text (Murphy & Edwards, 2005). Moreover, analysis of 
the QtA discourse showed a relatively high incidence of authentic 
questions, uptake, and teacher questions that promoted high-level 
thinking—all indicators of productive discussions likely to pro- 
mote learning and comprehension of text (Soter & Rudge, 2005). 


The MyST System 


System Description 


Students learn science in MyST through natural spoken dialogs 
with the virtual tutor Marni, a 3-D computer character that is on 
screen at all times. Marni asks students open-ended questions 
related to illustrations, silent animations, or interactive simulations 
displayed on the computer screen. Figure | displays a screen shot 
of Marni asking questions about media displayed in a tutorial. The 
student’s computer shows a full screen window that contains 
Marni, a display area for presenting media, and a display button 
that indicates the listening status of the system. Marni produces 
accurate visual speech, with head and face movements that are 
synchronized with her speech. 
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My Science Tutor (MyST) screen layout. 


Figure 1. 


We call these conversations with Marni multimedia dialogs, 
because students simultaneously listen to and think about Marni’s 
questions while viewing illustrations and animations or interacting 
with a simulation. The media facilitate dialogs with Marni by 
helping students visualize the science they are discussing. The 
primary focus of each dialog is to elicit self-explanations from 
students. MyST analyzes the spoken explanations to determine 
what the student does and does not know about the science, then 
presents follow-up questions, which may be accompanied by new 
media, to help the student construct a correct explanation of the 
phenomena being studied. The virtual tutor Marni, who speaks 
with a recorded human voice, is designed to behave like an 
effective human tutor that the student can relate to and work with 
to learn science. This is achieved by modeling dialogs between 
students and human tutors trained in using QtA during the devel- 
opment phase of the project. These dialogs scaffold learning by 
providing students with support when needed until they can apply 
new skills and knowledge independently (Vygotsky, 1978). 

Marni elicits self-explanations from students using strategies 
that embody QtA dialog moves such as marking and revoicing. 
These two techniques require that the system identify the student’s 
dialog content (marking it) followed by repeating (revoicing) a 
paraphrase of the information back to the student as a part of the 
next question: You mentioned that electricity flows in a closed 
path. What else can you tell me about how electricity flows? 
Marni’s responses are designed to communicate this understanding 
back to the students and to engage and assure them that she 
understands what they are saying. 

A tutorial session generally begins with relating the session to 
what the student has recently covered in class (during a science 
investigation), with Marni saying something like: What have you 
been studying in science recently? If the student says something 
recognizable as the tutorial topic (e.g., “We made a circuit”), the 
system moves forward by asking the student what they know about 
the topic: You mentioned circuits. Can you tell me what a circuit 
is? If nothing from what the system extracted from the student’s 
answer relates to the topic, then Marni introduces the topic: J heard 
you were learning about circuits. Can you tell me what a circuit 
is? For each key concept discussed, the interaction typically begins 
with a general open-ended question (accompanied by media, such 
as a picture of a simple circuit): What’s this all about? or What’s 
going on here? and then proceeds to more directed open-ended 
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questions like: Can you tell me more about the flow of electricity 
in the circuit? 

Media are used to ground the conversation, focus the student’s 
attention, help the student visualize the science, and provide a 
visual frame of reference for the student to talk about. The media 
are not narrated, and they do not explain the concept to the student. 
A typical strategy used by MyST is to show an animation to the 
student and ask him or her to explain what is going on. The use of 
media was initially intended as a mechanism to get students past 
sticking points, points in a dialog when the system is not able to 
elicit information from the student that it can build on. During 
dialogs with project tutors during system development, discussed 
below, the method proved so useful for eliciting explanations that 
tutors began to use this as the standard introduction to concepts: 
ask an introductory question about what a student knows, show an 
illustration, and ask what is going on. 

As noted, MyST dialogs incorporate three types of media: (a) 
illustrations, (b) animations, and (c) interactive simulations, illus- 
trated in Figure 2. Although these sometimes overlap in the content 
presented, each plays a unique role. Illustrations are static Flash 
drawings and are a good way to initiate discussions about topics. 
They provide the student with a visual frame of reference that 
helps focus the student’s attention and the subsequent discussion 
on the content of the illustration: So, what’s going on here? 
Animations are noninteractive, silent Flash animations that help 
students visualize concepts that can be difficult to capture in 
illustrations. In Figure 2, the direction of the flow of electricity is 
represented by blue dots moving from the D-cell through the wires 
and bulb and back to the D-cell. The animations enable Marni to 
ask the student questions to elicit explanations about what is being 
shown. Simulations allow students to interact directly with the 
Flash animation using a mouse. Figure 2 shows a simulation of a 
FOSS classroom investigation called “Breaking the Force” in 
which students investigate how much weight (number of metal 
washers) is required on one side of a balance scale to break the 
force of the magnets attracting each other on the other side. The 
number of washers in the cup and the space between magnets 
can be investigated and graphed in this simulation. During 
multimedia dialogs, as students are interacting with a simula- 
tion, the tutor can say things like: What could you do to .. .? 
What happens if you .. .? 


System Operation (How Spoken Dialogs Work) 


MyST uses character animation, automatic speech recognition, 
natural language processing, and dialog modeling to support con- 


illustration 


animation 
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versations with Marni. The dialogs are designed to elicit responses 
from students that show their understanding of a specific set of 
points. The key points of a dialog are specified as propositions 
realized as semantic frames. The frames represent the events and 
entities in the domain and the roles that they play. For example, 
Current goes from the negative terminal to the positive would be 
represented as: Electricity Flows Origin.negative Destination. 
positive. During spoken dialogs, the tutor asks questions that are 
designed to elicit student responses that will map to the elements 
of the targeted semantic frames. Information extracted from stu- 
dent responses is integrated into the session context that represents 
which points have been addressed by the student, which have not, 
which were expressed correctly, and which represented miscon- 
ceptions. In analyzing a student’s answer, the system tests whether 
the correct values are filling the semantic roles (i.e., whether the 
value of Origin is negative or positive). On the basis of the current 
context, the system generates questions to elicit explanations of the 
elements needed to produce a complete explanation. Follow-up 
questions and media presentations are designed to scaffold learn- 
ing by providing hints about the important elements of the inves- 
tigation that the student did not include or misunderstood. When 
possible, the follow-up questions are created by taking a relevant 
part of the student’s response and asking for elaboration, explana- 
tion, or connections to other ideas. 

This interaction style is well suited to automatic speech recog- 
nition (ASR) technology, which will have some amount of recog- 
nition error. In sessions in which the system is able to accurately 
recognize and parse student responses, it is able to adapt the 
tutorial to the individual student. It may move on to another point 
or delve more deeply into a discussion of concepts that were not 
correctly expressed by the student, using marking and revoicing to 
incorporate information from the student’s response. If the student 
does not seem to grasp the basic elements under discussion, the 
system presents more background material. If the system is unable 
to elicit and understand relevant student responses, by default it 
proceeds through the session with a full discussion of each point. 

Using spoken responses in this way can increase efficiency and 
naturalness of the interaction while minimizing the impact of 
system errors. False-negative errors, in which the system does not 
recognize correct information provided by the student, simply 
cause the system to continue to talk about the same point in a 
different way rather than moving on. False-accept errors, where 
the system fills in an element because of a recognition error, may 
cause the system to move on from a point before it is sufficiently 


simulation 





Figure 2. Media types. 
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covered. False-accept errors are rare and have not proved to be a 
problem. 


System Development 


During the development and evaluation of MyST, data were 
collected from tutoring sessions at elementary schools in the 
Boulder Valley School District (BVSD). A team of project tutors 
was trained in the FOSS content and QtA-based interaction style. 
Using FOSS teacher guides, the team developed learning objec- 
tives and specifications for media presentations aligned to each 
classroom science investigation. Tutors went into the schools and 
tutored students using the materials developed. Visuals were pre- 
sented on laptops, and students wore headsets for recording their 
speech. The recorded sessions were reviewed in group meetings to 
revise the presentations and determine sticking points that would 
benefit from the introduction of media. These meetings also helped 
foster a common style across tutors. In addition, transcripts of 
tutoring sessions were reviewed and annotated by M. McKeown to 
provide constructive feedback to the project tutors on how to use 
QtA principles most effectively. The data collected in the human- 
tutored sessions were used to train the speech recognition and 
natural language-processing modules to interpret the students’ 
speech and to develop dialog models to attempt to emulate the 
behavior of the human tutors. These modules were integrated to 
produce the first version of MyST that was used in Wizard-of-OZ 
(WOZ) studies. 


WOZ 


WOZ data collection attempts to provide user interactions sim- 
ilar to the target application, but a human controls the system 
behavior. In the WOZ collection, students independently inter- 
acted with Marni, while a remote human tutor, connected to the 
student’s computer via the Internet, monitored and controlled 
the system’s behavior. The human wizard could see everything on 
the student’s computer and hear what the student was saying. At 
each point in a dialog when the system was about to take an action 
(e.g., have Marni talk; present a new illustration), the action was 
first shown to the human wizard who could accept or change the 
action. The system logged all transactions during the session. 
Transcriptions of the dialogs in each session were then reviewed 
by developers to refine the dialog model. The primary changes 
during this phase of development included adding new media, 
expanding the coverage of the natural language processing (to 
accommodate new ways students could talk about concepts), and 
adding new ways of asking students questions. As the tutorials 
evolved, human wizards intervened less. 

In sum, during initial development of tutorial dialogs with 
human tutors, a total of 189 students received human tutoring over 
a total of 427 sessions. During the subsequent WOZ sessions, a 
total of 347 students received WOZ tutoring over 1,156 sessions. 
The purpose of data collected during development was to improve 
system coverage, that is, modeling the different ways that diverse 
students talked about science and refine the media presentations, 
so the emphasis was on including a greater variety of students, 
with less data from each individual student than in the system 
evaluation. 
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System Evaluation 


All data collected in the human-tutoring and WOZ sessions 
were used to train the final acoustic, language, and dialog models 
for the virtual tutoring system. During the 2010—2011 school year, 
an assessment of the MyST system was conducted to examine the 
effect of the virtual tutor on student test scores in science. During 
the assessment, students interacted with Marni independently in 
their schools, without a human wizard. An experimenter logged 
students into the MyST system and specified the dialog session to 
be used, but otherwise left students alone to use the system. The 
experimental design compared students receiving MyST tutoring 
with those receiving face-to-face human tutoring in small groups. 

Students were randomly assigned within classrooms to tutoring 
condition, and these groups were also compared with students 
from intact control classrooms with no tutoring. Students com- 
pleted one of four FOSS modules (Variables, Magnetism, and 
Electricity, Measurement and Water) and were tested pre—post 
with the FOSS-ASK assessment for that module. All students 
received similar classroom instruction. The two hypotheses for the 
study were as follows: 


Hypothesis 1: Students receiving tutoring with MyST will 
show learning gains roughly similar to students receiving 
face-to-face human tutoring. 


Hypothesis 2: Both groups receiving tutoring will show 
greater learning gains than students receiving no tutoring. 


Method 


Participants 


Data were collected from tutoring sessions at elementary 
schools in the BVSD. BVSD is a 27,000-student school district 
with 34 elementary schools. There is substantial student diver- 
sity across schools, which vary from low to high performing on 
state science tests. A list of potential schools was developed in 
collaboration with the BVSD science director. All third-, 
fourth-, and fifth-grade teachers at these schools were invited to 
participate in the study, and teachers who accepted were en- 
rolled in the study. All students in the classrooms of partici- 
pating teachers were invited to participate. All students who 
agreed to participate were enrolled. All third-, fourth-, and 
fifth-grade teachers in the district who did not participate as 
treatment classrooms were recruited to serve as control class- 
rooms, and those who agreed were enrolled. 

The data set contained 1,478 students at 22 schools and 63 
classrooms. One hundred two students in 14 classrooms in six 
schools were tutored with MyST, and 85 students in these same 
classrooms received human tutoring. Control students ac- 
counted for 1,155 students in 49 classrooms and 19 schools. 
These students received no tutoring, but did receive instruction 
in FOSS modules during class. For analysis, nonconsented 
students were removed from the sample. Other reasons for 
removing students from the sample included unmatched pre- 
post tests where students did not fill out a majority of answers 
and tests with grading concerns, including very low reliabilities. 
The remaining sample totaled 1,167 students. Eighty-three stu- 
dents received MyST tutoring, 69 were tutored in small groups 
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(both in 12 classrooms), and 1,015 students in 50 classrooms in 
20 schools received only classroom instruction and no tutoring. 
All missing data were removed by an analyst who was blind to 
the experimental condition. 


Procedure 


Consented students in the study were assigned to receive tutor- 
ing in addition to their normal classroom instruction for the mod- 
ule. Teachers specified the space in the school to be used, and this 
varied from school to school, generally any relatively quiet room. 
The teacher also scheduled the time for their students to minimize 
the impact on the student’s other activities. Tutoring times were 
always during regular school hours. General guidelines were that 
this time should not be at recess or lunch, during core subject time 
(reading, math, science), or during special activities time (art, 
music). 

All students in the study received in-class instruction in the 
FOSS modules: Measurement (third grade), Magnetism and 
Electricity (fourth grade), Water (fourth grade), and Variables 
(fifth grade). Teachers in both treatment and control classrooms 
followed module lesson plans and used FOSS materials. Stu- 
dents participating in the study received tutoring from MyST or 
human tutors for 12-16 20-min sessions concurrent with their 
regular classroom instruction. Each tutorial was oriented around 
a set of key concepts the student was expected to have learned 
from classroom instructional activities. Both MyST and human 
tutoring used the same multimedia content linked to FOSS 
content. MyST students were tutored individually on comput- 
ers. Headsets with earphones and microphones were used to 
reduce noise interference. For most sessions, eight students at a 
time used the computers in a separate resource room at each 
school. Students in the human tutoring condition received tu- 
toring with human tutors for the same amount of time as those 
in the MyST group. They worked in groups of three to four 
students with each human tutor. Although one-on-one interac- 
tion with a human tutor would present a more direct comparison 
to the virtual tutor condition, the study did not have sufficient 
resources to provide one-on-one human tutoring; however, re- 
search has demonstrated equivalent learning gains for one-on- 
one and small-group tutoring (e.g., Bloom, 1984). 


Measures 


Students in all experimental groups were given the ASK sum- 
mative assessments as pre- and posttest measures. Tests were 
administered before the beginning of the FOSS lessons for the 
module, and immediately after tutoring for the module ended. 
The ASK assessments for the four modules used in the assessment 
have identical pre and post versions. Depending on the module, the 
assessments have between eight and 12 items, consisting of 
multiple-choice and constructed response questions, and show 
composite internal reliability with alphas in the range of 0.80— 
0.90. The interrater reliability for subjective items has also met 
high standards in similar conditions (e.g., r = .90), and the validity 
of the measures has been built up over time through a process of 
empirical investigation. 

Because module tests have different scales, scores were stan- 
dardized to a common metric. All standardization was conducted 
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on data with outliers and other spurious data removed. “Testwise” 
standardization subtracted the mean of each test (over all students 
and pooling pre/post) from each student’s score. This difference 
was then divided by the average standard deviation for both pre 
and post for each test. 

Pairs of raters (tutors) scored all assessments from tutored 
students and a subset of assessments from control students. 
Raters trained together with scoring rubrics provided by FOSS, 
then scored the assessments independently. All scoring was 
blind to experimental condition (human tutor, virtual tutor, no 
tutoring) and whether the assessment was pre or post. Interrater 
reliabilities for two raters were high (counting only the open- 
ended items), with intraclass correlation coefficients ranging 
from .89 to .98, with averages for pre and post of .93 and .94, 
respectively. Internal reliabilities (Cronbach’s alpha) were 
lower, ranging from a = .60 to a = .89 for both pre and post 
versions of the assessments, with averages for pre = .74 and 
post = .79. Scores used for outcome analysis were the averages 
across both raters. 


Results 


Several comparisons were made to test the hypotheses. To 
make comparisons, both standardized pre/post scores and re- 
sidual gain scores compared groups on the average differences 
between their observed and expected scores. Gain differed 
markedly depending on where students started on the pretest, 
regardless of which group they belonged to. Students who 
started lower on the pretest gained more than students starting 
higher. This is often a sign of regression toward the mean where 
greater gain occurs for students starting lower regardless of 
actual learning. Regression toward the mean complicated the 
group comparisons for this study because the control students 
on average scored much lower on the pretest than students 
receiving tutoring. We believe the lower pretest scores for the 
control were primarily due to two factors: 

1. Consented students (those whose parents returned signed 
permission forms) had higher pretest scores than nonconsented 
students. Pretest scores for nonconsented students were similar to 
the control group. 

2. Schools choosing to participate as treatment groups in the 
study were not representative of the overall free and reduced lunch 
(FRL) percentage of the district. Boulder Language Technologies 
worked with BVSD officials to identify a set of schools to recruit. 
All classroom teachers for the targeted grades in those schools 
were recruited, and all of the teachers who agreed to participate 
were enrolled. In this particular study, those teachers who agreed 
to participate represented schools that had smaller percentages of 
FRL students. Schools with higher percentages of FRL students 
tend to have lower test scores, and more of these schools were in 
the control group. 

When group comparisons were made, control students tended 
to gain more pre to post than tutored students simply because 
they started lower on the pretest. Residual gain scores and 
analysis of covariance (ANCOVA) were used for analysis to 
adjust for these differences in prescore (Rudestam & Newton, 
1999). The residual gain score is the observed score minus the 
expected score in the scatter between pre and post; the expected 
score is the regression line for the scatter. It is used to compare 
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groups and has a mean of zero, with a scale representing 
standard deviation units. 


Comparison Between Tutored Groups 


The first hypothesis examined whether MyST and human- 
tutored groups were roughly equal to each other in pre/post gain. 
Students were randomly assigned within classrooms to tutoring 
conditions. Standardized gain for the human-tutored group (M = 
1.95, SD = 0.85) was not significantly different than for the 
MyST-tutored group (M = 1.75, SD = 1.03), ¢150)= —1.31, p = 
.190, d = .18. Residual gain for the human-tutored group (M = 
0.51, SD = 0.66) was also not significantly different than for the 
MyST-tutored group (M = 0.38, SD = 0.76), t(150) = —1.15, p = 
.250, d = .15. Power analysis showed that for an effect size of d = 
.15, sample sizes of 600 students per group would be needed to 
reach significance at the .05 level with 80% power. The small 
effect size and lack of statistical significance support the first 
hypothesis that benefits of tutoring are roughly equal for human 
tutors and Marni in pre/post gain. 


Comparison With Control Group 


As stated, comparisons with the students in control classrooms 
were complicated by differences in pre-test scores. To adjust for 
these differences, comparisons were made with residual gain 
scores and an ANCOVA to test the second hypothesis that students 
in tutored groups gained more than students in the control group. 
Standardized gain scores showed a moderate difference between 
MyST (M = 1.75, SD = 1.03) and control (M = 1.57, SD = 1.01; 
d = .18) and a larger difference between the human (M = 1.95, 
SD = 0.86) and control (d = .40). Effect sizes for residual gain 
scores were calculated by the difference in means between groups 
divided by the pooled standard deviation for the residual gain 
distribution. A moderate effect size was observed for the compar- 
ison of MyST tutoring (M = .38, SD = .76) and control (M = 
—.06, SD = .84; d = 0.53) and a larger effect size for human 
tutoring (VM = .51, SD = .66) and control (d = 0.68). A one-way 
analysis of variance (ANOVA) tested whether group means dif- 
fered significantly on residual gain score. The main effect for 
tutoring was significant, F(2, 1164) = 26.06, p < .001. Post hoc 
tests showed significant differences between both tutoring groups 
and the control group, and no significant differences between the 
two tutoring groups. 

An ANCOVA confirmed the findings from the analysis of 
residual gains. Like residual gain scores, ANCOVA also adjusts 
group means for differences in pretest. ANCOVA in this context 
gave almost identical results to the ANOVA using residual gains, 
F(2, 1163) = 26.60, p < .001. Comparisons of adjusted means 
were also nearly identical to effect sizes in residual gains for 
groups. ANOVA and ANCOVA tests support the second hypoth- 
esis that tutored groups gain significantly more from pre to post 
than students in the control group. 

Gain was also assessed as a function of prescore. Group com- 
parisons divided the prescore distribution for the tutored group into 
five equal parts. All groups showed higher gain for the lower 
prescore blocks. 

The use of hierarchical models allows for partitioning of error 
between students and classrooms, and quantifying how much total 
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variability is due to each level. Estimates of classroom variability, 
calculated with all students in the classroom, equaled 46%. Hy- 
pothesis testing for classroom effects showed significant effects 
for both MyST compared with control, (60) = 2.5, p = .014, and 
human compared with control, (60) = 3.0, p = .004. These results 
from hierarchical models also support the second hypothesis that 
tutored groups gain more from pre to post than the control group. 


Component Evaluation 


In order to evaluate the performance of the speech-processing 
components, student utterances for a subset of the assessment data 
were manually transcribed and parsed into frames to give the 
reference data to compare against. ASR performance is typically 
expressed as a word error rate (WER), which is the sum of word 
deletion, insertion, and substitution errors divided by the number 
of words in the reference string (from human transcriptions). The 
speech recognizer vocabulary size was 6,235 words. The WER for 
the assessment sessions was 41.4%.' This is a large WER, and 
would not be viable for many applications. The system performed 
well even with the high WER because the accuracy of extraction 
of frame elements (the key concepts being discussed) from stu- 
dent’s speech remained relatively high, with an overall Recall = 
79% and Precision = 82%. So 79% of the relevant information in 
the reference parses was correctly extracted from the ASR output. 
Of the information extracted, 82% of the elements were correct. 
These results indicate that many of the recognition errors were in 
information that was not relevant or redundant. Given the nature of 
QtA dialogs and the way spoken responses are used by the system, 
this level of extraction accuracy was sufficient to produce both 
engaging and effective dialogs, as indicated by students’ responses 
to questionnaires and the learning gains. 


Survey Results 


A written survey was given to the students who participated in 
the 2010-2011 assessment. Measures were taken to avoid bias 
wherein students give overly positive answers to questionnaires 
including the following: (a) Written (vs. oral) surveys for students 
were administered, (b) students were verbally assured of anonym- 
ity, (c) questionnaires were anonymous in that students did not 
write their names on the survey, and (d) adults from the program 
did not directly observe or interfere with students while they 
completed the survey. The survey included questions that asked for 
ratings of student experience and impressions of the program and 
its usability. Three-point rating scales for survey items were keyed 
to each question. A typical question, such as How much did Marni 
help with science? had responses such as: Did not help, helped 
some, helped a lot. Items were written to reflect the reading level 
of the students. In general, students had positive experiences and 
impressions about the program. Across schools, 47% of students 
said they would like to talk with Marni after every science inves- 
tigation, 62% said they enjoyed working with Marni “a lot,” and 


' The performance of the ASR system was enhanced significantly over 
the course of the project, and WER on the assessment data is now 21%. 
However, the system and models were fixed at the start of the assessment 
to avoid confounding the evaluation results with improvements in the 
performance of the speech recognition system. 
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53% selected “I am more excited about science” after using the 
program. Only 4% felt that the tutoring did not help. 

Teachers were asked for feedback to help assess the feasibility 
of an intervention using the system and their perceptions of the 
impact of the system. A teacher survey was given to all partici- 
pating teachers directly after their students completed tutoring. 
Teachers were assured anonymity in their responses both verbally 
and in written form. The questionnaire contained 22 rating items as 
well as nine open-ended questions. The survey asked teachers 
about the perceived impact of using Marni for student learning and 
engagement, impacts on instruction and scheduling, willingness to 
potentially adopt Marni as part of classroom instruction, and 
overall favorability toward participating in the research project. 
Additionally, teachers answered items related to potential barriers 
in implementing new technology in the classroom. Of the respond- 
ing teachers, 100% said that they felt it had a positive impact on 
their students, they would be interested in the program if it were 
available, and they would recommend it to other teachers. In 
addition, 93% said that they would like to participate in the project 
again. Furthermore, 74% indicated that they would like to have all 
of their students use the system (not just struggling students). They 
commented that students who used the system were more enthused 
about and engaged in classroom activities and that their participa- 
tion in science investigations and classroom discussions benefitted 
students who did not use the system. 


Conclusion 


In the present article, we presented the motivation, design, and 
evaluation results for a conversational multimedia virtual tutor for 
elementary school science. The operating principles for the tutor 
are grounded in research from education and cognitive science. 
Speech, language, and character animation technologies play a 
central role because the focus of the system is on engagement and 
spoken explanations by students during spoken dialogs with a 
virtual tutor. 

An assessment was conducted in schools to compare learning 
gains from human tutoring and MyST with business-as-usual 
classrooms. Both tutoring conditions had significantly higher 
learning gains than the control group. Although the effect size for 
human tutors versus control (d = 0.68) was larger than for MyST 
versus control (d = 0.53), statistical tests supported the hypothesis 
of no significant difference between the two. 

After the assessment, surveys were collected from students and 
teachers that bear on the engagement and feasibility of the tutoring 
system. Following a series of tutoring sessions with Marni, the 
great majority of students reported that they enjoyed spending time 
working with her, that they felt that Marni helped them learn 
science, and that they felt more interested in science and more 
motivated to learn science than they had before using the system. 
Teachers reported that they would like to use MyST in the future 
to tutor all of their students and that they would recommend the 
program to other teachers. 

One conclusion that we draw from this study is that current 
spoken dialog and character animation technologies can be com- 
bined with media to provide engaging and effective experiences 
for third-, fourth-, and fifth-grade students learning science. Stu- 
dents who used MyST interacted with Marni for 4—5 hr over the 
course of the 16 dialog sessions over an 8- to 10-week period. No 
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students dropped out of the study, and the large majority of 
students reported positive experiences. We believe that the QtA 
approach helped assure the student that Marni is listening to and 
understands what they are saying; this experience is fostered. by 
dialog moves such as revoicing and marking that Marni produces. 
Dialogs based on QtA enable the tutorial dialog to proceed in a 
graceful way even when the system does not accurately interpret 
what the student said, because the system typically proceeds with 
a reasonable follow-up question, which the student accepts as a 
natural extension of the dialog. 

The system described presents baseline results for one specific 
system based on a number of design decisions. Further work is 
needed to understand the effects of the individual features of the 
system. For example, we do not know the relative contribution of 
media in helping students visualize science and construct expla- 
nations, or the contribution of the dialog moves and questions that 
Marni generated, to the learning gains that occurred. We believe 
the MyST system provides a framework and infrastructure for 
conducting research on these questions. Planned future work will 
allow us to expand the context of the interaction from one-on-one 
tutoring to systems that support conversations in which a virtual 
tutor is able to mediate conversations among small groups of 
students. The virtual tutor will then be able to ask questions that 
help students build on each other’s ideas to coconstruct explana- 
tions consistent with accurate mental models of the science. 
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A Tutoring System That Simulates the Highly Interactive Nature of 


Human Tutoring 


Sandra Katz and Patricia L. Albacete 
University of Pittsburgh 


For some time, it has been clear that students who are tutored generally learn more than students who 
experience classroom instruction (e.g., Bloom, 1984). Much research has been devoted to identifying 
features of tutorial dialogue that can explain its effectiveness, so that these features can be simulated in 
natural-language tutoring systems. One hypothesis is that the highly interactive nature of tutoring itself 
promotes learning—that is, the interaction hypothesis. Although reasonable and agreeing with much 
research, the interaction hypothesis raises the question of what linguistic mechanisms are involved: that 
is, which features of “highly interactive” dialogues trigger what processes that are conducive to learning? 
Our overall strategy in the research described in this article was to inform this question by identifying 
co-constructed discourse relations in tutorial dialogues whose frequency of occurrence predicts learning, 
identify the context in which these relations occur, and use this knowledge to formulate decision rules 
to guide automated dialogues. We used Rhetorical Structure Theory to identify and tag co-constructed 
discourse relations in a large corpus of physics tutoring dialogues. Our analyses suggest that the 
effectiveness of human tutoring might well lie in the language of tutoring itself. Moreover, the types of 
co-constructed discourse relations that predict learning seem to vary based on students’ ability level. We 
describe Rimac, a natural-language tutoring system that implements an initial set of decision rules based 
on these analyses. These rules guide reflective dialogues about the concepts associated with physics 


problems. Rimac is being pilot tested in high school physics classes. 


Keywords: instructional dialogue, natural-language tutoring systems, Rhetorical Structure Theory 


Educators and policy makers in the United States have looked to 
educational technology as a tool to increase students’ proficiency 
in math, science, reading, and other subject matter domains. For 
example, early in his administration, President Obama (2009) 
challenged developers of intelligent tutoring systems (ITSs) to 
develop “learning software as effective as a personal tutor” (para. 
19). Apparently, Obama cast this challenge a bit too late. A recent 
meta-analysis of research comparing the effectiveness of human 
tutors with state-of-the-art ITSs showed that ITSs have already 
nearly caught up with human tutors (VanLehn, 2011), with effect 
sizes (d) of 0.76 for human tutoring and 0.79 for ITSs relative to 
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no tutoring (e.g., problem solving and reading, without feedback). 
This comparison raises the bar for developers of ITSs. The chal- 
lenge now is to develop automated tutors that can perform even 
better than human tutors with learners of all types. 

Several researchers have proposed that the large effect sizes of 
human tutoring can be attributed to its highly interactive nature— 
that is, the high degree to which the student and tutor respond to 
and build upon each other’s dialogue moves (e.g., M. T. H. Chi, 
Siler, Jeong, Yamauchi, & Hausmann, 2001;? Graesser, Person, & 
Magliano, 1995; van de Sande & Greeno, 2010). However, an 
important line of research conducted in the past few years to test 
this so-called interaction hypothesis showed that it is neither how 
much interaction takes place during tutoring that is important, nor 
the granularity of interaction—for example, whether the student 
and tutor discuss a step toward solving a problem or the substeps 
that lead to that step. Instead, what matters most is how well the 
interaction is carried out—for example, what content is addressed 
and how it is addressed in a particular dialogue context (e.g., M. 
Chi, VanLehn, Litman, & Jordan, 2010, 2011a, 2011b; Murray & 
VanLehn, 2006). 

This important finding suggests that the key to building tutoring 
systems that surpass the effectiveness of human tutors is to specify 


"VanLehn’s (2011) review showed that the two sigma effect for human 
tutoring reported by Bloom (1984) testifies to the importance of a mastery 
learning standard and is not typical of human tutoring in general. 

* Two important players in the field of tutoring research have the same 
last name and first initial. In our citations, we use M. T. H. Chi to refer to 
Michelene (“Micki”) T. H. Chi and M. Chi to refer to Min Chi. 
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what we mean by effective interaction and to formulate “policies 
for selecting the tutorial action at each microstep when there are 
multiple action options available” (M. Chi et al., 2011a, p. 87). 
Such “policies” have alternatively been called pedagogical tutor- 
ing tactics or pedagogical decision rules. We use the latter term 
here (decision rules, for short). As several developers of natural- 
language (NL) tutoring systems have argued, since tutorial dia- 
logue is a form of discourse, defining effective interaction entails 
identifying the particular linguistic mechanisms that support learn- 
ing during tutorial interaction (e.g., Boyer et al., 2010; Di Eugenio 
& Green, 2010; Pilkington, 2001; Ravenscroft & Pilkington, 
2000). Decision rules can then be specified to guide the tutor in 
determining when and how to carry out these linguistic mecha- 
nisms. 

This article describes the development of Rimac, a natural- 
language tutoring system that scaffolds students in acquiring a 
deeper understanding of the physics concepts and principles asso- 
ciated with quantitative physics problems. Rimac was designed to 
supplement instruction in physics tutoring systems such as Andes 
(e.g., VanLehn et al., 2005).* Rimac is primarily engineered to 
implement decision rules that guide the automated tutor in carrying 
out two linguistic mechanisms that have been found to predict 
learning from human tutoring: tutors’ abstraction and specification 
of students’ dialogue contributions (e.g., Katz, Allbritton, & Con- 
nelly, 2003; Ward, Connelly, Katz, Litman, & Wilson, 2009). This 
finding is supported by a significant body of prior research that 
demonstrates that the formation of abstract schema (i.e., mental 
representations of learned material) promotes transfer (e.g., Gick 
& Holyoak, 1983, 1987; Leher & Littlefield, 1993; Reed, 1993; 
Salomon & Perkins, 1989). During tutoring, abstraction takes 
place when the tutor or student relates what his or her dialogue 
partner said to explain a more general concept or principle. For 
example, during physics tutoring, abstraction involves mapping 
the physical state presented in a problem to concepts and principles 
that explain that state or to a general script for solving that type of 
problem. Specification is the reverse and typically occurs when the 
tutor (or student) distinguishes between related concepts, instanti- 
ates a formula that represents a physics principle, applies a 
problem-solving script to the problem at hand, and so forth. 

From a linguistic perspective, abstraction and specification are 
often implemented through hypernym/hyponym pairs of terms 
(Halliday & Hasan, 1976). For example, in the following exchange 
from a live tutoring session, the tutor specifies “velocity” (hyper- 
nym) in the student’s turn to “horizontal components of the ve- 
locity” (hyponym). 


Example 1 


Student: Velocity is in the same direction as acceleration so the ball is 
faster coming down. 


Tutor: It [the ball] slows down going up, and it speeds up coming 
down— but all the time the horizontal components of the velocity stay 
unchanged. [italics ours] 


However, sometimes abstraction and specification are imple- 
mented through semantic relations between speaker turns, with 
few or no lexical cues such as those shown in Example 1, and 
inference is required to detect these semantic relations. For exam- 
ple, in the following exchange, the student needs to infer that the 
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tutor’s phrase “change in velocity” abstracts over the student’s 
phrase “final velocity is larger than the starting velocity.” 


Example 2 
Tutor: How do we know that we have an acceleration in this problem? 


Student: Because the final velocity is larger than the starting velocity, 
0. 


Tutor: Right—a change in velocity implies acceleration. [italics ours] 


In addition to implementing decision rules to guide the auto- 
mated tutor in abstracting and specifying students’ dialogue turns, 
Rimac also simulates a few other linguistic processes that com- 
monly occur during physics tutoring—most notably, joint con- 
struction of conditional reasoning relations, as we will illustrate 
presently (Louwerse, Crossley, & Jeuniaux, 2008). 

Several studies have shown that data-driven machine learning 
techniques such as reinforcement learning can be applied to logged 
interactions from natural-language tutoring systems in order to 
derive decision rules to guide tutorial interaction (e.g., Beck, 
Woolf, & Beal, 2000; M. Chi et al., 2010, 2011a, 2011b; Murray 
& VanLehn, 2006). Evaluations of ITSs that implement these rules 
have found that these systems significantly outperform counterpart 
systems that carry out random policies—for example, “eliciting” a 
problem-solving step or dialogue goal from the student sometimes, 
“telling” the student that step or goal at other times, without clear 
guidelines about what to do when. Some rule-driven tutoring 
systems have also outperformed systems that implement “fixed” 
tutoring policies (e.g., Murray & VanLehn, 2006)—for example, 
responding to students’ help requests with increasingly directive 
feedback such as prompt first, then hint, then teach relevant 
background knowledge, and then (if all else fails) tell the student 
what to do (the so-called bottom out hint). 

Although this research demonstrates the promise of automated 
methods for deriving effective decision rules to guide tutorial 
dialogue, it also shows that the process is both difficult and costly. 
As M. Chi et al. stated, “Finding effective tutorial tactics is not 
easy” (M. Chi, Jordan, VanLehn, & Litman, 2009, p. 197). In 
addition, the decision rules that stem from this approach are highly 
domain specific and difficult to interpret. Take, for example, one 
decision rule that M. Chi et al.’s (201 1a) reinforcement-learning- 
based system defined for “elicit versus tell”—that is, should a tutor 
prompt the student for domain content at a particular point in a 
dialogue or tell the student that content? 


Rule 6 suggests that when the next dialogue content step is difficult 
(StepSimplicityPS is 0), the ratio of physics concepts to words in the 
tutor’s turns so far is high (TuConceptsToWordsPS is 1), and the tutor 
has not been very wordy during the current session (TuAvg- 
WordsSesPS is 0), then the tutor should tell. (p. 96) 


On the one hand, finely nuanced rules such this one have the 
benefit that researchers using conventional experimental methods 
to test hypothesized decision rules could not predict these rules in 


3 Rimac is the name of a river whose source is in the Andes. Its name is 
a Quechua word meaning talking; hence, the nickname for Rimac, talking 
river. We thus considered the name Rimac to be well suited for a dialogue 
system that could be embedded within the Andes tutoring system. 
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the first place. Similar observations have been made of the use of 
automated approaches to identify linguistic features of tutorial 
dialogue that predict learning, such as hidden Markov models 
(e.g., Boyer et al., 2010). On the other hand, rules such as this are 
cryptic and complex to implement, as the researchers have ac- 
knowledged. 

In developing natural-language dialogues for Rimac, we strove to 
specify decision rules that were supported by preliminary empirical 
research, more intuitive than those illustrated previously, and readily 
implementable using a common framework for generating NL dia- 
logues, which we will describe presently. Consequently, we took a 
more conventional approach. We first performed correlational analy- 
ses to identify specific relations between tutors’ and students’ dia- 
logue moves in a large corpus of human-tutored physics dialogues 
that predict student learning gains from pretest to posttest. We then 
examined the context in which these relations typically occur and 
formulated decision rules that specify these contextual conditions. We 
implemented these rules within Rimac and are currently evaluating 
the system to determine if it outperforms a less interactive, less 
rule-driven tutoring system control. 

In the next section, we situate Rimac in a framework of tutoring 
research that highlights the need for effective decision rules to 
guide natural-language dialogue systems. In keeping with the 
theme of this special issue of the Journal of Educational Psychol- 
ogy, we then describe the empirical research that we conducted to 
derive decision rules to guide abstraction, specification, and other 
commonly occurring relations between students’ and tutors’ dia- 
logue turns, particularly during physics tutoring, and illustrate how 
we implemented these rules within Rimac. 


Cooperative Execution During Scaffolding 


The most intensive interaction during human one-on-one tutor- 
ing takes place during scaffolding, which M. T. H. Chi et al. (2001) 
defined as follows: 


[A] scaffolding move is a kind of guided prompting that pushes the 
student a little further along the same line of thinking, rather than telling 
the student some new information, giving direct feedback on a student’s 
response, Or raising a new question or a new issue that is unrelated to the 
student’s reasoning .... The important point to note is that scaffolding 
involves cooperative execution or coordination by the tutor and the 
student (or the adult and child) in a way that allows the student to take an 
increasingly larger burden in performing the skill. (p. 490). 


The nexus of scaffolding lies in the fourth step of Graesser et 
al.’s (1995) “five-step dialogue frame” (p. 504) to describe the 
cyclic nature of tutorial interaction: 


Step 1. Tutor asks question. 

Step 2. Student answers question. 

Step 3. Tutor gives short feedback on the quality of the 
answer. 

Step 4. Tutor and student collaboratively improve the qual- 
ity of the answer. 

Step 5. Tutor assesses student’s understanding of the 


answer. 
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As Graesser et al. (1995) and others (e.g., VanLehn et al., 2007) 
have noted, understanding Step 4 of this frame—that is, scaffold- 
ing to improve the student’s answer—could hold the key to 
understanding why human tutoring is so effective. 

M. T. H. Chi et al.’s (2001) definition of scaffolding names two 
linguistic mechanisms that drive it: coordination and cooperative 
execution. We consider coordination first, because more research 
has been devoted to describing it. Coordination refers to the ways 
in which the tutor and student “stay on the same page’”—that is, 
“srounding” the conversation, by acknowledging their dialogue 
partner’s moves, negotiating the meaning of terms, and sharing 
knowledge (Clark & Schaefer, 1989; VanLehn, 2011). Coordina- 
tion can also be supported by various forms of verbal alignment, 
such as lexical cohesion (e.g., word repetition, synonymy, para- 
phrase), and syntactic (word order) alignment (Garrod & Picker- 
ing, 2004). When the student hears his words (or word order) 
echoed in the tutor’s turn, the student knows that the tutor under- 
stood what he or she said. Several studies have shown that the 
degree of lexical and syntactic cohesion (alignment) during tutor- 
ing predicts learning (e.g., Litman & Forbes-Riley, 2006; Stein- 
hauser et al., 2011; Ward & Litman, 2008, 2011), in addition to 
potentially enhancing coordination. 

Cooperative execution refers to the joint construction of a line of 
reasoning. According to VanLehn (2011), cooperative execution takes 
place as tutors prompt students to continue a line of reasoning, 
indicate who should continue the execution, and accept the student’s 
reasoning (p. 211). Our observations of tutorial dialogues reveal that 
cooperative execution during scaffolding involves more than these 
dialogue management processes; it also involves co- construction of 
the parts of an emerging line of reasoning or explanation. The anal- 
yses described in the Method section were motivated by our hypoth- 
esis that tutoring researchers need to formally describe these co- 
constructed dialogue moves and determine which types of moves 
support learning in order to develop natural-language dialogue sys- 
tems that are as effective, or even more effective, than human tutors. 


A Linguistic Framework to Describe Cooperative 
Execution 


Rhetorical Structure Theory (RST) is a theoretical linguistic frame- 
work that specifies types of logical and functional relationships be- 
tween parts of text and spoken discourse, including various types of 
abstraction and specification relations. Mann and Thompson (1988), 
who developed RST, argued that “it describes the relations among 
text parts in functional terms, identifying both the transition point of 
a relation and the extent of the items related” (p. 271). Functional and 
logical relationships between parts of spoken and written discourse go 
by many names, including rhetorical relations, coherence relations, 
and discourse relations (Hovy, 1990). We use the latter term here. 

Table 1 defines and illustrates the set of abstraction/specification 
relations, and other discourse relations, which we manually tagged in 
a corpus of human tutorial dialogues in order to determine which 
co-constructed relations predict learning and are thereby most impor- 
tant to simulate in Rimac. For example, a student applies the equation 
for acceleration; the tutor then says something general about acceler- 
ation (e.g., “Acceleration is a vector and hence has direction as well 
as magnitude.”). In RST, this is a jointly constructed instance:abstract 
discourse relation. To take another example, the tutor describes a set 
of conditions that apply to a given physical situation—for example, 
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Table 1 
Discourse Relations Tagged in the Dialogue Corpus 


ee Ee 


Relation and definition (S = speaker) 


Example 


eee Ee Oe) SIR OE eae lei aoe Tite be eee! yo oe pre ee 
Abstraction/specification relations 


Abstract:instance (instance:abstract): $2 instantiates the abstraction stated 
by S1, or S2 abstracts over the information presented bys: 


Set:member (member:set): S2 presents a member of the set referred to 
by S1, or S2 names the set to which an item mentioned by S1 
belongs. 


Whole:part (part:whole): S2 names a part of an object that Sireferred to, 
or S1 names a part of an object named by S2. (In physics, “parts” are 
often vector components or the specific forces acting on an object.) 


Process:step (step:process): $2 presents a step that follows from the 
process or line of reasoning described by S1, or S2 describes the line 
of reasoning that leads to the step described by S1. 


Object:attribute (units, direction, magnitude): S1 names an object or 
value; S2 specifies a property of that object—in particular, its units, 
direction, or magnitude. 


Term:definition (definition:term): S2 defines a term mentioned by $1, or 
S2 labels a statement by S1 with an appropriate term. 


General:specific (specific:general): S2 names a state, object, or action 
that is related to the content in S1 but is more specific, or $2 is more 
general than the state, object, or action referred to in S1. Applies 
when none of the preceding relations apply. 


Tutor: How can the acceleration be 0 if there are forces on it? 

Student: The sum of the forces equal 0 for there to be no acceleration. 

Tutor: That’s exactly right. The weight and the normal force are (in 
this case) equal and opposite. 

Explanation: “In this case” (as the tutor says), the weight and normal 
force being equal and opposite represent an instance of the 
abstraction “sum of forces equal 0.” 


Tutor: What does the problem ask for? 

Student: The magnitude of the acceleration 

Tutor: What type of acceleration? 

Student: Average 

Explanation: The tutor refers to acceleration as a set and prompts for 
a member of that set; the student gives the type of acceleration 
asked for in the problem. 


Student: Acceleration would be plus. 

Tutor: Right, the x component of the acceleration would be plus. 

Explanation: The student names a vector (acceleration); the tutor 
refers to a specific component of that vector. 


Student: The acceleration is 0. 

Tutor: So then m*a = 0 = F,,, = T — W and hence T = W. 

Explanation: The student gives a step in a line of reasoning; the tutor 
expands the line of reasoning (process) that follows from that step. 


Student: Velocity is 14. 

Tutor: Right, 14 m/s. 

Explanation: The student provides a value for velocity; the tutor 
specifies its units. 


Tutor: What is the definition of the average acceleration (in words or 
in mathematics)? 

Student: A = (Vf — Vo)/Tf — To. 

Explanation: The tutor prompts the student to define average 
acceleration; the student does so. 


Student: Average acceleration can vary. 
Tutor: Right; it can go up above the average and down below it. 
Explanation: The tutor specifies how acceleration can vary. 


Other commonly occurring relations in physics tutoring 


Condition:situation (situation:condition): (a) S1 presents a condition or 
set of circumstances, and S2 states the situation that stems from or 
coincides with those conditions, or (b) S1 presents a situation, and $2 
states the conditions or circumstances that explain that situation. 


Compare: S2 compares an object, situation, or value referred to by S1 
with some other object, situation, or value. 


Tutor: When do kinematics equations apply? 

Student: When the acceleration is constant. 

Explanation: This relation could be stated in conditional form: if 
acceleration is constant, then the kinematics equations apply. 


Tutor: What is the net force that the air bag imparts to the driver? 
Student: Equal to the force the driver applies to the airbag. 

Tutor: Same direction? 

Student: No, opposite direction. 

Explanation: The tutor prompts the student to compare the value and 
direction of two. 





“A car is moving to the right and is suddenly stopped’”—and then 
prompts the student to state the situation that follows from this set of 
conditions—for example, that the car’s acceleration is to the left. This 
is a co-constructed condition:situation (conditional) relation. Any 
relation can be delivered didactically, by the tutor or student, instead 
of interactively, as in these examples. For example, the tutor could 
have stated the same conditional relation didactically as follows: 
“Since the car is moving to the right and is suddenly stopped, its 
acceleration is to the left.” However, we focused our investigation on 
the potential relationship between co-constructed discourse relations 


and learning because these relations realize cooperative execution 
during scaffolding. 


Method 


To reiterate, our goals in the analyses described in this section were 
to (a) determine if the frequency of particular types of co-constructed 
discourse relations (those described and illustrated in Table 1) predict 
learning, and whether this varies by student ability level, and (b) 
formulate decision rules that specify the context in which those 
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discourse relations predicting learning occurs, so that these rules can 
guide student-tutor interaction in a NL tutoring system (Rimac). 
Toward these aims, we coded all instances of co-constructed dis- 
course relations in a large corpus of human-tutored physics dialogues. 
The dialogue corpus and our approach to coding identified relations 
are described in this section. 


Dialogue Corpus 


A well-known problem in physics education is that many stu- 
dents learn to apply scripts for solving particular types of problems 
and succeed in college-level physics courses; however, they none- 
theless leave these courses without understanding fundamental 
physics concepts and principles (Halloun & Hestenes, 1985). Re- 
flective discussions following problem-solving exercises encour- 
age students to think about the concepts and principles associated 
with quantitative problems, often by changing some aspect of the 
problem and prompting the student to consider how the answer 
would change, as illustrated in Table 2. Several studies have 
demonstrated the instructional benefits of reflection on problem- 
solving exercises (e.g., Collins & Brown, 1986; Katz, Connelly, & 
Wilson, 2007; Katz et al., 2003; Lee & Hutchison, 1998; Tch- 
etagni, Nkambou, & Bourdeau, 2007; Ward & Litman, 2011). 

The dialogue corpus that we analyzed stems from previous research 
in which we compared the effectiveness of human-guided reflective 
discussions about physics problems solved within the Andes physics 
tutoring system (VanLehn et al., 2005) with static text explanations 


Table 2 
Example of a Reflective Dialogue Between a Human Tutor 
and Student 


Problem: In the figure below, each of the three strings exerts a tension 
force on the ring as marked. Use the labels S1, S2, and S3 to refer to 
the three strings. Find the components of the net force acting on the 
ring. 


¥ 


200 Nay 





‘ - ae 
200 N S3) 400 N 


Reflection question: What if I now told you that this ring has an 
acceleration. If you knew the mass of the ring (3 kg), how would you 
solve for the acceleration? 

Student: 73.2 — 3°a; 100 — Fw = 3a. Is this right; how would the 
acceleration be the same for both? 

Tutor: You have to keep the a_x and a_y distinguished. They are two 
completely independent numbers that (together with a_z) specify your 
acceleration vector. You don’t try to boil them down to one number. 
It’s as if I told you, “To get to my house, you go 3 blocks north and 
5 blocks east,” and you said, “Ah, so you just go 8 blocks’”—the two 
numbers together are the vector; they don’t “boil down” to one 
number. OK? 

Student: But can’t it only have one acceleration? 

Tutor: It does have only one acceleration, but that acceleration is a 
vector and it takes 3 numbers to write it down. You need to review 
vectors in some detail; a_x, a_y, and a_z together specify the 
acceleration vector. 


Note. This example problem is part of the Andes Physics Tutor system, 
which was developed at Arizona State University and the University of 
Pittsburgh with support from the Pittsburgh Science of Learning Center, 
National Science Foundation Award SBE-0836012, and Office of Naval 
Research Grant N00014-96-1-0260 and is available at http://www 
-andestutor.org 
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and a no-dialogue control. We summarize the data collection proce- 
dures that produced the dialogue corpus in this section. More details 
about the study can be found in Katz et al. (2003). 

Students who were taking an introductory physics course at 
the University of Pittsburgh first took a physics pretest, with nine 
quantitative and 27 qualitative physics problems. Following the pre- 
test, students reviewed a workbook chapter developed for the exper- 
iment and then received training on using Andes. There were three 
conditions: one in which students received reflection questions and 
interacted with a human tutor via a chat interface; a second reflection 
condition in which students were asked the same set of reflection 
questions but received a static text explanation as feedback after they 
responded to these questions; and a third, a control condition in which 
students were not asked reflection questions but solved more prob- 
lems than students in the other two conditions to control for time on 
task. There were 15 students in the static text and control conditions 
and 16 students in the human-tutored condition. In the correlational 
analyses discussed here, we only analyzed data from the human- 
tutored condition, since we were interested in modeling effective 
aspects of human tutorial dialogue. 

Students in each condition began by solving a problem in 
Andes. After completing the problem, students in both the static 
feedback and human-tutored conditions were presented with a 
conceptually oriented reflection question, as illustrated in Table 2. 
Reflection questions such as the one shown in Table 2 are not part 
of Andes; they were added for the experiment. After a student in 
the human-tutored condition entered a response to the reflection 
question, the student engaged in a typed dialogue with his or her 
tutor via a simple chat interface. This dialogue continued until the 
tutor was satisfied that the student understood the correct answer to 
the question. 

Between three and eight reflection questions were asked per prob- 
lem solved in Andes for a total of 12 problems. After completing these 
problems and their corresponding reflective dialogues, students took a 
posttest that was isomorphic to the pretest, and the test order was 
counterbalanced. The main finding of the study was that students who 
answered reflection questions learned more than students in the no- 
reflection control, who solved more Andes problems (Katz et al., 
2003). Consistent with authors of several other studies who found a 
null effect for the interaction hypothesis, we did not observe a signif- 
icant difference between the static feedback and human-tutored con- 
ditions (VanLehn, 2011; VanLehn et al., 2007). However, the human- 
tutored dialogue corpus revealed abundant instances of highly 
interactive, cooperative execution during scaffolding episodes—spe- 
cifically, exchanges in which the tutor incorporated parts of the 
student’s turn, built on the student’s turn, and so on (e.g., Table 
2)—or less frequently, the student did the same with respect to a 
preceding tutor turn. Hence, we deemed this corpus well-suited for 
exploring correlations between interactivity and student learning out- 
comes. 

The dialogue corpus is sizeable. Among the 16 students in the 
human-tutored condition (four men, 12 women), 15 completed all 
60 reflection question dialogues with a human tutor; one student 
participated in 53 dialogues, producing a total of 953 reflective 
dialogues. There were a total of 2,218 student turns and 2,135 tutor 
turns across dialogues. The average number of turns per reflective 
dialogue was 4.6, ranging from 2.1 turns for simple reflection 
questions to 11.4 turns for the most complex questions. All dia- 
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logue examples presented in this article stem from this tutoring 
corpus, unless otherwise noted. 


Coding Scheme 


Within each reflective dialogue, all student and tutor turns were 
first manually parsed into clauses. We then searched for co- 
constructed discourse relations at the exchange level—that is, 
between a tutor’s dialogue turn and the subsequent student turn, or 
the reverse. We coded these relations at two levels of analysis: 
abstraction level type, and discourse relation type. 


Abstraction Level Type 


At the coarsest level, we tagged the level of abstraction of each 
exchange in which a discourse relation was co-constructed. Four 
codes distinguish these levels of abstraction, as described in the 
following. Code abbreviations are shown in parentheses. 

Specific-to-general (spec:gen). This code refers to abstrac- 
tion, which happens in two main ways. The first type is when the 
second speaker refers to a more general concept, principle, or 
value than one that the first speaker referenced in his dialogue turn. 
For example, in the following exchange, the tutor refers to speed, 
and the student classifies speed as a scalar quantity: 


Example 3 


Tutor: Since the question asked about SPEED, suppose we had found 
v_y to be negative. Should we include the minus sign when giving the 
speed? 


Student: 1 would say no because speed is scalar and doesn’t include 
direction. 


In the second type of abstraction, the second speaker refers to a 
physics principle that explains or is illustrated by problem-specific 
content in the first speaker’s turn. For example, in the following 
exchange, the tutor prompts the student to apply a principle about 
the relationship between acceleration and velocity to the bullet in 
the case at hand: 


Example 4 


Reflection question: The bullet is travelling to the right. What direc- 
tion is its acceleration? 


Student: To the left because it is making the bullet slow down. 


Tutor: Good—when something is slowing down, its acceleration has 
a component opposite to its velocity. 


General-to-specific (gen:spec). This code refers to specifica- 
tion, which is the inverse of abstraction and also happens in two 
main ways. The first type is when the second speaker refers to a 
more specific concept, principle, or value than the one to which the 
first speaker referred. For example, in the following exchange, the 
tutor asks for the forces on a climber, and the student names two 
types of forces: 


Example 5 
Tutor: What are the forces on her? 


Student: Her weight and the tension of the rope. [italics ours] 
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The second main type of specification is when the second 
speaker instantiates a principle or concept to which the first 
speaker refers. For example, in the following exchange, the student 
carries out the tutor’s directive to apply Newton’s second law-to 
the current problem: 


Example 6 


Tutor: Now use Newton’s second law and find [the climber’s] accel- 
eration—a number and units; show me the symbols (the algebra). 


Student: 39/55 = a, a = .71 m/s42 downward. 


Specific (spec). This code refers to cases in which the student 
and tutor are both speaking at the same level of abstraction, typically 
in reference to a particular problem. For example, in the following 
exchange, the tutor and the student refer to the bungee in the current 
problem. The tutor explains the situation that would result from the 
student’s erroneous claim via a co-constructed conditional relation: 


Example 7 


Student: The only force acting on the bungee is the weight of the 
person. 


Tutor: If that were true, the bungee would accelerate downward! 


General (gen). This code refers to cases in which the student 
and tutor both speak at an abstract level, referring to principles, 
laws, definitions, and so forth that are not directly tied to a 
particular problem. For example, in the following exchange, the 
tutor and student step outside of the context of the current problem 
(about a falling hailstone) to discuss the difference between dis- 
tance and displacement, in this comparison relation: 


Example 8 
Tutor: Is there a difference between displacement and distance? 


Student: The displacement can have either value [+ or —], but 
distance is only +. 


Table 3 presents the mean and standard deviation of abstraction 
level tags across subjects. 


Discourse Relation Type 


At a finer level of analysis, we tagged the dialogue corpus for 
the particular types of abstraction and specification relations de- 
fined and illustrated in Table 1, in addition to two other commonly 
occurring discourse relations in physics tutoring dialogues—con- 
ditional reasoning statements and comparisons. Most of these 
discourse relations are bidirectional (e.g., set:member, member: 


Table 3 
Mean Frequency of Abstraction Level Tags Across Tutored 
Subjects (N = 16) 





Abstraction level Mean SD 
Specific-to-general 14.13 4.83 
General-to-specific e731 LD 
Specific 3.31 2.18 


General 11.06 5.31 
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set); the exceptions are object:attribute and compare. We tagged 
bidirectional relations separately (e.g., we treated set:member and 
member:set as individual relations) and also treated each of the 
object:attribute categories as a separate relation. Hence, overall, 
there are 17 discourse relations in our coding scheme. 

The basic unit of analysis at the discourse relation level is one 
of these codes, specified in two ways. First, we specify the direc- 
tion of the co-constructed relation in the exchange—that is, does 
the tutor (T) start the relation and then the student (S) completes it, 
or the reverse? The former is indicated by T—S before the discourse 
relation name, and the latter by S-T—for example, S—T set: 
member represents a set:member relation that the student initiates 
and the tutor completes; and T—S abstract:instance represents an 
abstract:instance relation that the tutor initiates and the student 
completes. To illustrate, in the example shown in Table 1 for 
set:member, the second exchange (T: What type of acceleration? 
S: Average) would be tagged as T—S set:member. 

The second way in which we modify discourse relation tags is by 
indicating whether the second turn in a tagged relation was prompted, 
via a question, or initiated by the second speaker. Prompted relations, 
such as the one for set:member, are unmodified—that is, T—S set: 
member means that the tutor prompted the student to provide a 
member of a named set, as in the preceding example about “type of 
acceleration.” Initiated relations are flagged as elaborations (elab), 
because the second speaker is adding information to what the first 
speaker said. To illustrate, in the example for abstract:instance shown 
in Table 1, the tutor elaborates on the student’s turn, by instantiating 
the student’s abstract statement: 


Example 9 
Student: The sum of the forces equals 0 for there to be no acceleration. 


Tutor: That’s exactly right. The weight and the normal force are (in 
this case) equal and opposite. 


This relation would be tagged as S—T elab(abstract:instance) to 
indicate that the tutor elaborated on the student’s statement via an 
abstract:instance relation. Instantiation is signaled by the tutor’s 
phrase “in this case.” 

In addition to prompted and initiated variants of discourse relations, 
in both directions (S—T and T-S), we included three types of aggre- 
gate variables in our analyses. One aggregate variable includes the 
four prompted and initiated (elaborated) forms of a discourse relation. 
For example, the aggregate variable whole:part represents: 


S-T whole:part + T-S whole:part + S—T elab(whole:part) + T-S 
elab(whole:part). 


The second type of aggregate variable includes the four forms of 
the first relation, plus the four forms of its inverse. For example, 
the following formula represents all-whole:part-bd, where bd 
means bidirectional, for a particular relation (e.g., whole:part and 
part:whole, each consisting of the four forms shown in the for- 
mula): 


[S-T whole:part + T-S whole:part + S—T elab(whole:part) + T-S 
elab(whole:part)] + [S—T part:whole + T-S part:whole + S—T elab 
(part:whole) + T-S elab(part:whole)]. 


The third type of aggregate variable includes the summation of all 
initiated elaborations. Specifically, T-S elab is the summation of 
student elaborations on the tutor’s previous turn, for all base 
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relations (e.g., whole:part, set:member); S-T elab is the summation 
of tutor elaborations of the student’s previous turn, for all base 
relations; and all-elab-bd = T-S elab + S-T elab. 

Table 4 summarizes the means and standard deviation of dis- 
course relation tags and aggregate tags across subjects. 


Data Analysis 


We conducted correlational analyses between the frequency of 
abstraction level codes, discourse relation codes, and three mea- 
sures of student learning: overall gain score from pretest to post- 
test, gain score on qualitative test items, and gain score on quan- 
titative test items. We conducted these analyses taking the 16 tutored 
students as a whole and separately for low and high pretest students, 
as classified according to a median split. There were seven high 
pretest students and nine low pretest students. These numbers are 
uneven because the two pretest scores in the middle of the distri- 
bution were identical; both students who had these scores were 
assigned to the low pretest group. We divided students into these 
ability groups in order to investigate whether better prepared 
students (high pretesters) might benefit from co-constructing dif- 


Table 4 
Mean Frequency of Discourse Relation Tags Across Tutored 
Subjects (N = 16) 


Discourse relation variable or 


aggregate variable Mean SD 
Abstract:instance 9.63 5.28 
Instance: abstract 3.50 2.34 
All-abstract:instance-bd 13e3 6.02 
All-compare 3.19 1.83 
Term:definition 3.00 225 
Definition:term (Oyeibis} 0.34 
All-term:definition-bd Balls D22 
Object:attribute-units 1.63 2.06 
Object:attribute-direction 4.06 2.41 
Object:attribute-sign 0.19 0.40 
Object:attribute-magnitude 0.69 0.79 
All-object-attribute 6.56 3.76 
Process:step 0.56 0.63 
Step:process 3.00 2.34 
All-process:step-bd 3.56 2.31 
Set:member 0.88 1.50 
Member:set 2.00 1.67 
All-member:set-bd 2.88 2.58 
Whole:part 2.88 1.78 
Part:whole 0.44 0.73 
All-part:whole-bd 3:31 1.99 
Circumstance:situation 13.00 6.79 
Situation:circumstance 9.19 2.90 
All-circumstance:situation-bd 22.19 6.93 
Gen:spec Sol 299) 
Spec:gen 0.75 1.07 
All-gen:spec-bd 4.06 2.24 
T-S elab 1.00 1.10 
S-T elab 22.13 12.15 
All-elab-bd PUBS AN) 12.41 


gE hs ae BS a ee eee 
Note. Aggregate variables include modified forms of the base relation 
(e.g., whole:part) as described in the text. Gen = general; Spec = specific; 
T = tutor; S = student; bd = bidirectional; T-S elab = summation of 
student elaborations on the tutor’s previous turn, for all base relations; S-T 
elab = summation of tutor elaborations of the student’s previous turn, for 
all base relations; all-elab-bd = T-S elab + S-T elab. 
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ferent types of discourse relations with their tutor than less well- 
prepared students (low pretesters). 

The results of these analyses are presented in the next section. 
We then describe the decision rules that stem from these findings. 


Results and Discussion 


Discourse Relations That Predict Learning: All 
Students Considered Together 


Correlations for the subject pool taken as a whole (N = 16) are 
displayed in Table 5. To save space, we only discuss significant 
findings (p = .05) for all three types of gain. 

Overall gain. The frequency of three discourse relations pre- 
dicted overall gain: (a) various forms of the whole:part relation 
[S—T elab(whole:part) and two aggregate variables: whole:part and 
all-whole:part-bd], (b) S—T situation:condition relations, in which 
the student prompts the tutor to specify the conditions under which 
a physical situation occurs and the tutor replies accordingly, and 


Table 5 
Correlations for All Students Considered Together (N = 16) 





Abstraction level and discourse 


relations Mean SD R Dp 
Overall gain 
Abstraction level: [spec:gen] 14.13 4.829 450 081 
Discourse relations 
S-T elab(step:process) 1.56 1.365 646  .007"™" 


step:process 3100 Re 2.538 Ole OLS 


S-T elab(member:set) 0.94 0.680 667) 8:005") 
S-T elab(whole:part) 1.00 1.366 524 .037 
whole:part 2.88 1.784 528 .035 
all-part:whole-bd 5.51 1.991 SRE .026 
S-T situation:condition 0.44 0.814 SL .034 
[definition:term] 0.13 0.342 —.485 057 
[all-proc:step-bd] B50 m2-508 473.064 


Qualitative gain 


Abstraction level: spec:gen 14.13 4.829 516 §=.041 

Discourse relations 
S-T elab(step:process) 156" 9 15365 653 .006™" 
step:process 3.00 2.338 S91 .016 
all-proc:step-bd 3596. 2.308 2] wae 036 
S-T elab(member:set) 0.94 0.680 ORO LS 
[T-S elab(term:definition)] 0.06 0.250 469 .067 
[definition:term] 0.13 0.342 -—.443 .086 
[S-T step:process] 0.06 0.250 469 ~=.067 
{whole:part] 2.88 1.784 463 ~=—-.071 
{all-part:whole-bd] 3.31 1.99] AST ~—.OT5 
[S—T situation:condition] 0.44 380.814 487 056 

Quantitative gain 

Abstraction level: spec a5 2.182 =O) 035 

Discourse relations 
S-T elab(set:member) 0.06 0.250 .740 001** 
S—T elab(whole:part) O04" 1:366 675.004" 
[T-S instance:abstract] 0.38 0.500 — 493 .052 
[Object:attribute-magnitude] 0.69 0.793 — 467 .068 
[Process:step] 0.56 0.629 —.445 .084 
[S—T elab(member:set)] 0.94 0.680 443.086 
[all-part:whole-bd] oro 1.991 452 ~=.079 


Note. Trends are indicated by brackets. Gen = general; Spec = specific; 
T = tutor; S = student; bd = bidirectional; elab = elaborated; proc = 
process. 

[7p Ol 
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(c) various forms of the step:process relation [S—T elab(step: 
process) and the aggregate variable step:process], in which one 
dialogue partner provides the steps in a line of reasoning that stem 
from, or lead to, a step in his partner’s turn, for example: 


Example 10 


Reflection question: How do we know that we have an acceleration in 
this problem? 


Student: Because of gravity pulling down. 


Tutor: The force due to gravity produces a net force and thus an 
acceleration. 


In this exchange, the tutor provides the line of reasoning that 
follows from the student’s response (gravity— existence of a net 
force — existence of acceleration), via an S—T elab(step:process) 
relation. 

Qualitative gain. Generalizations predicted learning of a 
qualitative (conceptual) nature; a trend was also found for gener- 
alization and overall gain. This is not surprising, given that gen- 
eralizations typically address physics concepts, laws, and princi- 
ples. As with overall gain, various forms of the step:process 
relation also predicted qualitative gain across subjects [S—T elab- 
(step:process) and two aggregate variables: step:process and all- 
process-step-bd]. In addition, a particular type of generalization 
predicted qualitative gain: S-T elab(member:set), in which the 
tutor elaborates on a student turn by stating the set to which an 
object that the student referred to belongs: 


Example 11 


Reflection question: How do we know that we have an acceleration in 
this problem? 


Student: Because it is a free fall problem so gravity is at work. 


Tutor: Gravity is a type of acceleration. 


Quantitative gain. The “spec” abstraction level type, repre- 
senting exchanges in which the tutor and student refer to the 
current problem, negatively correlated with quantitative gain. 
However, two particular forms of specification strongly predicted 
quantitative gain: S-T elab(set:member), in which the tutor states 
a member of a set that the student referred to, and S—T elab(whole: 
part), which typically reflects exchanges in which the tutor spec- 
ifies the components of a vector that the student mentioned or the 
applied forces on an object that the student mentioned: 


Example 12 
Student: (String! + String2)/g = mass of plane. 


Tutor: It would be (Fy,_y + Fry 


Vg = mass, OK? 


Discourse Relations That Predict Learning Among 
Low Pretest Students 


Correlations for low pretest students (NV = 9) are displayed in 
Table 6. We again focus our discussion on significant findings 
(p = .05) for all three types of gain. 

Overall gain. Student generalizations over the tutor’s turn 
positively correlated with low pretesters’ overall gain score; how- 
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Table 6 
Correlations for Low-Pretest Students (N = 9) 





Abstraction level and discourse 
relations Mean SD R Pp 





Overall gain 
Abstraction level 


T-S spec:gen 4.44 3.005 671  .048 

S-T spec 1.67 32a Ome O29) 
Discourse relations 

ST situation:condition 0.67 1.000 679 .044 

[S—-T elab(abstract:instance)] 1.89 1-833) 50245072 

[T-S member:set] 0.78 1.093 POI eee Old 

[S—T elab(member:set)] ae 0.782 .646  .060 


Qualitative gain 
Abstraction level 


T-S spec:gen 444 3.005 FOOSE OOBir 
[spec:gen] 16.00 5.050 600 .088 
Discourse relations 

situation:condition 8.89 2.667 .676 .045 
[S—T elab(abstract:instance)] 1.89 E8332 60155 
[ST elab(step:process)] 2.22 1.481 594 .092 


Quantitative gain 
Abstraction level 


spec 335 DESEO SO) » ORS 
[T-S gen:spec] SES 3 Ika 1662) 052 
Discourse relations 
T-S object:attribute-direction 3:33 D2 i 07 2 O47) 
S-T elab(set:member) 0.11 0.333 .884 .002** 
S-T elab(whole:part) 1.44 1.590 HEME ORR 
S-T elab(gen:spec) 1.78 Peo 680 .044 
[object:attribute-direction] 4.56 2109) 209)" 2091) 
[situation:condition] 8.89 DOO 1657 2055 


Note. Trends are indicated by brackets. T = tutor; S = student; gen = 
general; spec = specific; elab = elaboration. 
ee Oil 


ever, tutor specifications relative to the tutor’s turn negatively 
correlated with overall gain. Consistent with the findings from the 
set of students taken together, one discourse relation whose fre- 
quency predicted overall gain among low pretesters was S—-T 
situation:condition, in which the student asks the tutor to explain 
the circumstances under which a given physical state (velocity 
decreasing in the y direction) applies: 


Example 13 
Student: Why is velocity decreasing in the y direction? 


Tutor: It starts out going up and gravity pulls it down. When accel- 
eration is opposed to velocity, the object slows down. 


Qualitative gain. Low pretesters’ abstraction over the tutors’ 
turns (T-S spec:gen) predicted qualitative gain score, consistent 
with a trend for abstraction either by the student or the tutor 
(spec:gen) to predict qualitative gain. Only one aggregate dis- 
course relation variable significantly predicted qualitative gain 
among low pretest students: situation:condition, which is the con- 
ditional relation in which the second speaker provides the condi- 
tions that explain the situation described by the first speaker, either 
because the first speaker solicited this information or the second 
speaker initiated it. Example 13 illustrated a student-solicited 
conditional relation. The following exchange shows a tutor 
prompting the student to specify a condition in a T—S situation: 
condition relation: 
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Example 14 
Tutor: Why does the tension equal the weight in this problem? 


Student: Because there are no other outside forces acting on the 
bungee/jumper system. 


Encouraging low pretest students to explain their claims (e.g., 
tension = weight) appears to be beneficial and is under the 
tutoring system’s control, in contrast to student-initiated condition- 
als, such as the one shown in Example 13. 

Quantitative gain. Consistent with the findings for all stu- 
dents considered together, the frequency of exchanges in which 
both participants focused on the case at hand negatively correlated 
with quantitative gain among low pretest students. In addition, the 
frequency of one type of specification negatively predicted quan- 
titative gain for this group: T—S object:attribute-direction relations, 
in which the tutor prompts the student to specify the direction of a 
value. Specifying the correct direction of a vector often requires 
conceptual understanding, so this negative correlation could reflect 
the difficulty that less-prepared students have in determining di- 
rection. However, the frequency of several other specification 
relations predicted quantitative gains for low pretesters—in par- 
ticular, tutor-initiated set:member [S—T elab(set:member)], whole: 
part [S-T elab(whole:part)], and gen:spec [S—T elab(gen:spec)] 
relations. The following exchange illustrates the tutor adding more 
specific information to the student’s dialogue turn, in an S—T 
elab(gen:spec) relation: 


Example 15 


Reflection question: Does gravity have any effect on the vertical 
motion of the firecracker? What about the horizontal motion? Explain 
your answers. 


Student: Vertical motion, yes; it makes it harder for the firecracker to 
travel away from the earth because gravity is pushing down, so it adds 
resistance. 


Tutor: Good, that is right (and it pulls the firecracker back down after 
the high point also). 


As this exchange illustrates, students sometimes answer ques- 
tions correctly but not completely. The tutor added information 
necessary to complete the student’s answer to the reflection ques- 
tion. Perhaps making low pretest students aware of complete 
answers, by adding to students’ dialogue contributions, increases 
these students’ quantitative problem-solving ability. 


Discourse Relations That Predict Learning Among 
High Pretest Students 


Correlations for high pretest students (V = 7) are displayed in 
Table 7. We again focus our discussion on significant findings 
(p = .05) for all three types of gain. 

Overall gain. The frequency of only one discourse relation 
significantly predicted high pretest students’ overall gain score: 
S-T elab(whole:part), which was also observed for the group of 
students as a whole. As discussed previously, this relation typically 
occurs when the tutor specifies the components of a vector that the 
student named, the specific forces that comprise the net force, etc. 
This finding suggests that adding this level of precision to high 
pretesters’ dialogue contributions supports learning. 
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Table 7 


Correlations for High Pretest Students (N = 7) 
Ce a a hal 2 
Abstraction level and discourse 


relations Mean SD R Dp 
a ee 


Overall gain 


Abstraction level: [S-T Gen] 3.00 1.826 —.700 .080 
Discourse relations 
S-T elab(whole:part) 0.43 0.787 S20 O22 
[object:attribute-units] 1.14 0.690 109.074 
[object:attribute-direction] 3.43 ass} == 7AlBy Oy 
[all-object-attribute] 5.43 Le at OO OSS 
[S-T elab(member:set)] 0.71 0.488 119 069 
Qualitative gain 
Abstraction level 
[T-S spec] 2.29 1.704 123) e066 
[T-S spec:gen] 4.86 2.134 339m 061 
[Gen] 10.43 5.224 —.688  .087 
Discourse relations 
S-T elab(term:definition) 0.29 0.756 PSOSme OD 
object:attribute-units 1.14 0.690 POW RODS) 
S-T whole:part 0.14 0.378 863 =.012 
S-T elab(whole:part) 0.43 0.787 .809 028 
T-S elab(condition:situation) 0.29 0.756 SOS OZ 
[situation:condition] 9.57 3359 097 082 
Quantitative gain 
Abstraction level: [Gen:spec] 32.00 LOS 720008 
Discourse relations 
T-S step:process 1.00 1.000 5O3 nue 020) 
S-T elab 18.00 6325 ae a0) e049 
all-elab-bd 19.00 G9525 7/025) 046 
{S—-T elab(instance:abstract)] 2 1:39 js OSM OO 
{all-abstract:instance-bd] egal a) ee SOE ODS 
[step:process] 7A 1.113 694 .084 


Note. Trends are indicated by brackets. S = student; T = tutor; Gen = 
general; elab = elaboration; bd = bidirectional. 


Qualitative gain. The frequency of several discourse relations 
predicted qualitative gains among high pretest students: tutor def- 
initions of terms mentioned in the student’s dialogue move [S—T 
elab(term:definition)]; whole:part relations [S-T whole:part, and 
S-T elab(whole:part)]; conditional relations that the student takes 
the initiative to complete [T—S elab(condition:situation)]; and one 
aggregate variable—object:attribute-units, in which the tutor 
prompts the student to provide missing units, or does this for the 
student. Tutor-initiated definitions typically occurred when the 
student used a term incorrectly and the tutor corrected it, as 
illustrated in the following exchange: 


Example 16 


Student: The force equals the mass of the book plus the other forces 
acting on it, which would be considered the acceleration. 


Tutor: Well ... the acceleration is the rate of change of its velocity. 


Perhaps giving high pretest students the definition of a misused 
term sometimes suffices to correct their knowledge. 

It is unclear why providing units (or prompting students to 
provide units) might support qualitative understanding. Perhaps 
units cement the difference between concepts or support students 
in understanding the temporal and spatial properties of physical 
concepts. 

Quantitative gain. The frequency of one discourse relation 
predicted quantitative learning among high pretest students: ex- 
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changes in which the tutor provides a step in a line of reasoning 
and prompts the student to provide the line of reasoning that 
follows from that step or that is necessary to get to that step. For 
example, in the following exchange, the tutor states the final step 
in the problem (tension = weight) and prompts the student to 
explain how she arrived at that conclusion, via a T-S step:process 
relation: 


Example 17 


Tutor: OK ...so then why does tension = weight .. . show me how 
you got your answer. 


Student: F = Fie, — ma, a = 0, s0 mg = Freon: 


Two aggregate variables indicate that elaborations potentially 
hinder high pretesters’ ability to gain quantitative knowledge and 
skills: S — T elab and all-elab-bd. The latter includes all elabora- 
tions initiated by either students or tutors; however, most were 
issued by tutors (354 vs. 16). Perhaps filling in too many details in 
the line of reasoning hinders learning among more knowledgeable 
students; it might be better to let them fill in the gaps on their own, 
as indicated by prior research on textual coherence (e.g., McNa- 
mara, 2001; McNamara & Kintsch, 1996; McNamara, Kintsch, 
Songer, & Kintsch, 1996). 


Decision Rules to Guide Automated Scaffolding 


The analyses discussed in the previous section suggest that 
particular forms of cooperative execution that take place during 
scaffolding, implemented via co-constructed discourse relations, 
predict learning gains. However, since correlation does not imply 
causality, we need to determine if a tutorial dialogue system that is 
explicitly designed to encourage joint construction of these poten- 
tially beneficial discourse relations outperforms a counterpart tu- 
toring system not so designed. 

This section describes decision rules that stem from the findings 
discussed in the preceding section. These rules can guide the 
tutoring system in simulating these potentially effective aspects of 
human tutoring. Where appropriate, we provide further detail on 
the context in which these rules apply than we did in the previous 
section. In the next section, we illustrate how these decision rules 
are implemented in Rimac, in contrast to a control dialogue sys- 
tem. 

Rule 1. When the student provides a step in a line of reason- 
ing, the tutor may provide the missing steps of the line of reason- 
ing, rather than ask about each step individually. 

This decision rule stems from several correlations involving 
the step:process relation—specifically, for the group of students 
taken as a whole, the frequency of S—T elab(step:process) 
relations predicted overall gain, R(14) = .646, p = .007, and the 
aggregate variable step:process predicted both overall gain and 
qualitative gain, R(14) = .582, p = .18, and R(14) = .591, p = 
.016, respectively. The tutor’s extension of the student’s line of 
reasoning took place in three main contexts: (a) when the 
student answered a question correctly but not completely, as 
illustrated in Example 10; (b) when the student had some 
trouble coming up with a problem-solving or reasoning step, in 
which case the tutor filled in some of the line of reasoning and 
then prompted the student for additional steps; and (c) when the 
student reached the final step of a solution or line of reasoning, 
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in which case the tutor summarized the steps leading up to that 
conclusion. This mainly happened at the end of a problem. 

Rule 2. /f a student states a value but does not state how he 
derived it, the tutor should prompt the student to explicate his 
reasoning process. 

This rule is similar to the preceding one, except that here the 
student, not the tutor, is expanding the line of reasoning as 
illustrated in Example 17. It stems from the finding that the 
frequency of T-S step:process relations predicted quantitative 
learning gains, particularly for high pretest students, R(5) = 
831, p = .020. 

Rule 3. When students state vectors rather than vector com- 
ponents while solving equations, the tutor should provide the 
corresponding equation with components. Alternatively, the tutor 
should prompt the student to provide the vector components. 

This rule stems from several correlations involving the basic 
whole:part relation. For example, the frequency of S-T 
elab(whole:part) relations, in which the tutor specifies the vector 
components (Example 12), predicted overall gain for the whole 
group of students, R(14) = .524, p = .037. In addition, two 
aggregate variables predicted overall gain: whole:part and all- 
whole:part-bd, R(14) = .528, p = .035, and R(14) = .553, p = 
.026, respectively. Similar correlations were found for the group of 
high pretest students. 

Rule 4. When the student oversimplifies the circumstances 
under which a given physical situation applies or fails to make 
explicit the relationship between a narrower term and a broader 
term, the tutor should make these “member:set” relations explicit. 

This rule is based on the finding that the frequency of S—T 
elab(member:set) relations predicted overall gain for all students 
taken together, R(14) = .667, p = .005, for low pretest students, 
R(7) = .646, p = .060, and for high pretest students, R(S) = 0.719, 
p = .069. Example 11 illustrates a case in which the tutor states the 
class in which a narrower concept belongs (e.g., gravity is a type 
of acceleration) when the student’s claim implies this but does not 
say it explicitly. 

The following exchange illustrates the tutor reacting to a stu- 
dent’s oversimplification of the circumstances associated with a 
physical situation. The student provides two examples of forces 
that could account for constant velocity (or a null net force); the 
tutor names the set “Anything else [other forces] that could make 
the net force 0”: 


Example 18 


Student: No acceleration for a constant velocity; this would only be 
possible for a situation with a great deal of air resistance or friction. 


Tutor: Or anything else to make the net force 0! The forces could be 
different. 


Rule 5. The tutor should ask “why” questions when the stu- 
dent does not provide an explanation to support a claim, especially 
with less knowledgeable students. 

This rule stems mainly from our finding that the frequency of 
conditional relations in which the tutor specified the conditions 
under which a situation described by the student applied (i.e., S-T 
situation:condition relations), correlated with overall learning 
gains for the group of low pretest students, R(7) = .679, p = .044. 
The aggregate variable situation:condition also predicted qualita- 
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tive gains for this group, R(7) = .676, p = .045. This finding 
supports Louwerse et al.’s (2008) suggestion that prompting stu- 
dents to express conditional relations exposes gaps in their rea- 
soning process that the tutor can address, and this exercise pro- 
motes learning. 

Example 13 illustrates a case in which a student takes initiative 
and asks the tutor to state the conditions that explain a given 
situation, while Example 14 illustrates the more readily imple- 
mented case of the tutor prompting the student to state relevant 
conditions, via a T-S situation:condition relation. “Why” prompts 
such as this typically occur when the student answers a question 
correctly but does not justify his answer, as in the following 
exchange: 


Example 19 


Reflection question: Does average acceleration imply that the accel- 
eration is the same at every instant? 


Student: No. 
Tutor: Correct—could you say why? 


Student: Because average is taking different velocities over different 
times. 


Rule 6. Jf the student answers a question incorrectly, if pos- 
sible show why it is incorrect by stating the conditions under which 
it would be correct. 

This rule is related to the preceding and is mainly motivated by 
the correlation between the frequency of the aggregate situation: 
circumstance relation and qualitative gains among low pretest 
students. It reflects cases in which a student states a situation (the 
consequent in a conditional relation) and the tutor provides the 
conditions (antecedent) that would hold true if the situation were 
true. For example, in the following dialogue excerpt, the tutor 
states the conditions that would explain a net force of 0 on a 
bungee jumper: 


Example 20 


Reflection question: What minimum acceleration (in magnitude) must 
the jumper have in order for the cord not to break while he is on his 
way down? 


Student: 700 N/mass = a. 


Tutor: Not quite, good start. What is the “net™ force on him? (in terms 
of the tension and mg)? 


Student: The net force is 0. 


Tutor: Ah, OK. When he is hanging there, it is 0, or if he is moving 
with constant velocity. 


Rule 7. /f the student gives a partially correct answer, the 
tutor should complete it, especially for less knowledgeable stu- 
dents. 

This rule is based on the finding that the frequency in which the 
tutor extends a partial or underspecified statement in the student’s 
dialogue turn, via S-T elab(gen:spec) relations, correlated with 
quantitative gains, among low pretest students. Example 15 dem- 
onstrates a tutor’s application of this rule. 
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gravity down 


contact forces 
n 





<remediation> 


Figure 1. 


What are the forces exerted on the egg after the clown releases it? Please specify their directions. 





gravity 


Yes. The direction 
is vertically down 
Are there any other fore 


es on the egg? 
0 


Now, what is the net force on the egg? 


The dialogue paths of three students as they traverse the arcs in a knowledge construction dialogue 


vertical down 


What force is applied because 
he egg is near the earth? 


gravity 


What is the direction of the 
force of gravity? 


vertical 






(KCD). Adapted from “Tools for Authoring a Dialogue Agent That Participates in Learning Studies,” by P. W. 
Jordan, B. Hall, M. Ringenberg, Y. Cui, and C. P. Rosé in R. Lucklin, K. R. Koedinger, & J. E. Greer (Eds.), 
2007. AIED 2007: Proceedings of the 13th International Conference on Artificial Intelligence in Education, Los 
Angeles, CA (p. 48). Copyright 2007 by IOS Press, Amsterdam, the Netherlands Adapted with permission. 


Rule 8. When the student uses a term incorrectly, give the 
definition of the term to help the student correct his or her mistake. 

This rule stems from the finding that the frequency of S-T 
elab(term:definition) relations, in which the tutor defined a term 
that the student stated incorrectly or misapplied, correlated with 
qualitative gains, particularly among high pretest students, R(5) = 
.863, p = .012. Example 16 illustrates this rule. 

Rule 9. The tutor should ask for missing units or prompt the 
student to provide them, especially when a student is performing 
well—for example, when the student is close to solving a problem 
or answering a qualitative question. 

This rule is based on the finding that the frequency of the 
aggregate variable object:attribute-units, which includes all ex- 
changes in which the student presented a value without units and 
the tutor either provided these units or prompted the student to do 
so, correlated with qualitative learning among high pretest stu- 
dents, R(5) = .817, p = .025. In the following exchange, the tutor 
provides the missing units: 


Example 21 
Student: T — mg = ma; 500 — 539 = 55a. 


Tutor: Good deal. (I would add units there by the way: SOON — 539N = 
55 kg"a.) 


This rule is supported by prior research which used auto- 
mated, machine learning methods to determine when abstrac- 
tions and specification take place during reflective dialogues 
(Lipschultz, Litman, Jordon, & Katz, 2011). This research 
found that tutors tend to abstract over the student’s dialogue 
contribution early in a reflective dialogue, when students are 
having difficulty responding to the tutoring system’s reflection 
question. These abstractions appear to be aimed at ensuring that 
the student understands the basic concepts needed to answer the 
automated tutor’s question. Then, as the dialogue progresses, 
and the student is closer to answering the reflection question 
correctly, specification becomes more frequent than abstrac- 
tion, as tutors probe students for precision—for example, to 


specify units and direction for a vector quantity, when the 
student only states its magnitude. 


An Example: Implementing Decision Rules in Rimac 


Reflective dialogues in Rimac were developed using a well- 
known framework called knowledge construction dialogues 
(KCDs). Several NL tutoring systems that have been found to be 
more effective than noninteractive forms of instruction (e.g., read- 
ing a textbook) have been developed using the KCD framework 
(e.g., Aleven, Popescu, Ogan, & Koedinger, 2003; Connelly & 
Katz, 2009; Graesser et al., 2003; Graesser, Person, Harter, & the 
Tutoring Research Group, 2001; Jordan, Ringenberg, & Hall, 


Table 8 
Example of a Rimac Problem and Reflection Question 


A modern supertanker is gigantic: 1,200-1,300 ft. long with a 200-ft. 
beam. Fully loaded, it chugs along at about 16.0 knots (8.33 m/s). It 
can take 1,200 s to bring it to a full stop. Calculate both the 
acceleration in m/s* and the displacement of the tanker using this 


stopping time. 
pt). 





~2) 


Acceleration: 
Displacement: 


Reflection question: Is the acceleration pointing in the same direction 
or in the opposite direction of the supertanker’s motion? 


Note. This example problem is part of the Andes Physics Tutor system, 
which was developed at Arizona State University and the University of 
Pittsburgh with support from the Pittsburgh Science of Learning Center, 
National Science Foundation Award SBE-0836012, and Office of Naval 
Research Grant N00014-96-1-0260 and is available at http://www. 
andestutor.org 
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Table 9 
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Application of Rule 5 in an Experimental Knowledge Construction Dialogue (KCD) but Not in the Corresponding Control KCD 


Experimental KCD 


Tutor: In the problem you’ ve just solved, is the acceleration pointing in 
the same direction or in the opposite direction of the supertanker’s 
motion? 

Student: Opposite. 

Tutor: So why did you think this? 

Student: Because when an object is slowing down, acceleration is 
opposite to its motion. 

Tutor: Very good. Now let’s step through the reasoning. 


Note. Bold typeface indicates application of the rule. 


2006; Katz et al., 2007; Rosé et al., 2001). During a KCD, the 
automated tutor presents a series of carefully ordered questions to 
the student, known as a directed line of reasoning (DLR; Evens & 
Michael, 2006). If a student answers a question correctly, the 
student advances to the next question in the DLR. Otherwise, the 
system launches a remedial subdialogue and then returns to 
the main DLR after the remedial subdialogue has completed. This 
process is illustrated in Figure 1. 

KCDs in Rimac were implemented using TuTalk, a NL- 
dialogue-authoring toolkit (Jordan et al., 2006; Jordan, Hall, 
Ringenberg, Cui, & Rosé, 2007). TuTalk enables domain experts 
to construct NL tutoring systems without programming. Instead, 
they can focus on defining the tutoring content and structure of 
KCDs. 

From a research perspective, the main advantage of using 
KCDs is that the content and structure of KCDs are determined 
a priori by the dialogue developer, so different versions of a 
given KCD can be designed to test a hypothesis. Since our goal 
was to determine if the decision rules that we specified to guide 
simulation of cooperative execution during scaffolding enhance 
learning, we developed two versions of each Rimac KCD: one 
version that implements these rules in appropriate contexts and 
another that simulates the standard KCD practice of the tutor 
eliciting information from the student, hinting when possible, 
and stating the answer after the student has made one or two 
unsuccessful tries. 

We illustrate these two versions of a Rimac KCD with respect 
to the problem and reflection question shown in Table 8. Dialogue 
excerpts illustrate implementation of three of the decision rules 
described in the preceding section in the experimental version of 
the dialogue. 


Table 10 


Control KCD 


Tutor: In the problem you’ ve just solved, is the acceleration pointing in 
the same direction or in the opposite direction of the supertanker’s 
motion? 

Student: Opposite. 

Tutor: Very good. Now let’s step through the reasoning. 


In the dialogue excerpt shown in Table 9, the decision-rule- 
driven KCD applies Rule 5 because the student answered the 
question correctly but without justifying it: The tutor should ask 
“why” questions when the student does not provide an explanation 
to support a claim, especially for less knowledgeable students. In 
contrast, the standard KCD excerpt just gives the student positive 
feedback. In both cases, the dialogue steps through the line of 
reasoning that leads to the correct answer because even students 
who answer “why” questions correctly sometimes do so by rote. 

An intermediate step of the line of reasoning is shown in Table 
10. In this case, the experimental (decision-rule-driven) KCD 
applies Rule 6: /f the student answers a question incorrectly, if 
possible show why it is incorrect by stating the conditions under 
which it would be correct. In contrast, the standard KCD simply 
corrects the student without acknowledging or addressing the 
student’s response. Reformulating students’ dialogue contribu- 
tions, whether correct or incorrect, is a common practice during 
human tutoring, but its effectiveness has not been directly tested. 
We implemented specific procedures within TuTalk to simulate 
reformulation in Rimac (Jordan, Katz, Albacete, Ford, & Wilson, 
2012). 

At the final step of the line of line of reasoning, the decision- 
rule-driven KCD summarizes the main physics principle addressed 
in this reflection question (see Table 8) and recaps the line of 
reasoning that supports this principle, thereby applying Rule 1. In 
contrast, the standard KCD only states the general principle, as 
shown in Table 11. 

These alternative versions of Rimac are currently undergoing 
pilot evaluation during the kinematics and dynamics units in high 
school physics classes. Several urban and suburban schools in the 
Pittsburgh, PA, area are participating in this pilot study. We predict 


Application of Rule 6 in an Experimental Knowledge Construction Dialogue (KCD) but Not in the Corresponding Control KCD 
$< eee 


Experimental KCD 


Control KCD 





Tutor: When an object is slowing down, how does the final velocity (vf) 
compare to the initial velocity (vi) for any interval of time? (smaller, 
larger, other) 

Student: Larger 

Tutor: If the object is speeding up, then its final velocity is larger 
than its initial velocity. But when an object is slowing down, its 
velocity is getting smaller all the time. So for any interval of time, the 
final velocity is smaller than the initial velocity. 


Tutor: When an object is slowing down, how does the final velocity (vf) 
compare to the initial velocity (vi) for any interval of time? (smaller, 
larger, other) 

Student: Larger 

Tutor: When an object is slowing down, its velocity is getting smaller 
all the time. So for any interval of time, the final velocity is smaller 
than the initial velocity. 


sao a a 


Note. Bold typeface indicates application of the rule. 
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Application of Rule 1 in an Experimental Knowledge Construction Dialogue (KCD) but Not in the Corresponding Control KCD 


Experimental KCD 


Control KCD 


MD ia ia O_O) OUR Pd Sees 


Tutor: Hence from the definition of acceleration, how do the direction of 
the acceleration and the direction of the change in velocity relate? 
(same, opposite, no relation, it depends) 

Student: They are the same. 

Tutor: Very good. Hence, in the current problem, the direction of the 
acceleration opposes the direction of motion of the supertanker 
because the direction of the change in velocity opposes the motion 
of the supertanker. And this happens because the final velocity is 
smaller than the initial velocity, and velocity has the same 
direction as motion. Therefore, in general, we can say that when an 
object is slowing down, the direction of the acceleration opposes the 
motion of the object. 


Tutor: Hence from the definition of acceleration, how do the direction of 


the acceleration and the direction of the change in velocity relate? 
(same, opposite, no relation, it depends) 


Student: They are the same. 
Tutor: Very good. In general, when an object is slowing down, the 


direction of the acceleration opposes the motion of the object. 





Note. Bold typeface indicates application of the rule. 


that the decision-rule-driven version will outperform the less in- 
teractive control and that the effect will be greatest for less pre- 
pared students, a common finding for evaluations of instructional 
interventions (VanLehn et al., 2007). 


Conclusion 


The holy grail of tutoring research is to identify specific features 
of human tutorial dialogue that account for its remarkable effec- 
tiveness (e.g., Bloom, 1984; Cohen, Kulik, & Kulik, 1982), so that 
these features can be simulated in NL tutoring systems. Although 
the interaction hypothesis posits that more interactive tutoring will 
result in more learning, research to test this hypothesis shows that 
constructs like interactivity and cooperative execution are too 
vague to guide automated tutoring and, in particular, the scaffold- 
ing that takes place when students are having difficulty solving a 
quantitative problem or answering a conceptual question. In order 
to operationalize interactivity and cooperative execution, we need 
to identify the linguistic mechanisms that implement these con- 
structs during human one-on-one tutoring and determine which 
mechanisms enhance learning. This knowledge can then be used to 
formulate decision rules that can be implemented and tested within 
NL tutoring systems. The research described in this article takes a 
step in this direction. 

Overall, this study supports the interaction hypothesis. Our 
analyses suggest that the effectiveness of human tutoring might 
very well lie in the language of tutoring itself—in particular, in the 
types of discourse relations that students and tutors co-construct 
during tutorial dialogues. Moreover, the types of co-constructed 
discourse relations that predict learning seem to vary according to 
students’ ability levels. However, given the small sample size, 
these findings should be cross-validated by analyses of dialogue 


Table 12 
Alternative Ways of Prompting for a Conditional Relation 


corpora involving a larger number of subjects (both students and 
tutors). 

A second limitation of this work stems from its focus on 
co-constructed discourse relations. It might well be the case that 
some discourse relations are better “told” than “elicited,” that is, 
conveyed through direct, didactic explanations, instead of co- 
constructed while questioning the student. For example, we were 
surprised that we did not find a relationship between the frequency 
with which a tutor stated abstract principles or formulae (e.g., the 
equation for Newton’s second law) and prompted students to 
instantiate these principles, as captured by the T-S abstract:in- 
stance relation, and student learning. However, this does not ne- 
gate the potential effectiveness of instantiation of variables, prin- 
ciples, and so on during tutoring. Perhaps the didactic form of this 
relation (abstract:instance) does support learning, among some 
groups of students, but our analyses did not investigate correlations 
between didactically delivered discourse relations and learning. 
Hence, one goal of our future work will be to compare the 
effectiveness of didactic and interactive forms of particular dis- 
course relations. 

A third limitation of this research is that we did not consider 
variations in the way that co-construction of discourse relations is 
carried out and how these variations might impact learning. For 
example, we observed that there are two main ways in which tutors 
address abstractions. Tutors either anchor discussions about con- 
cepts and principles in the case at hand (i.e., the current problem) 
or address these abstractions in context-independent terms. For 
example, in both dialogue excerpts shown in Table 12, the tutor 
addresses the conditional: if an object travels upward and comes 
back down, its vertical displacement is 0. In the excerpt shown in 
the left column, the tutor grounds this abstraction in the current 





Context-specific prompt for a conditional relation 


Tutor: Picture in your mind’s eye ... firecracker goes up, and then comes 
down and lands on the ground. What is the net vertical displacement 


for that whole process? 
Student: 0. 


Context-independent prompt to complete a conditional relation 


Tutor: Regardless of whether we call ground level y = 0 or y = 500, 
what is the y component of the displacement for an object that 
goes up and then comes back down to ground level? 

Student: 0 meters. 
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physical situation about a firecracker. He provides the antecedent 
of the conditional (the “if clause”) and prompts the student for the 
consequent (the “then clause”). In contrast, in another dialogue 
about the same problem, shown in the right column of Table 12, 
the tutor speaks in more general, context-independent terms; he 
refers to “an object,” not to the firecracker. Future research should 
examine which approach (if either) is better and for which types of 
students. 

One important lesson that automated approaches to identifying 
decision rules to guide tutoring has taught us is that the “night” 
pedagogical move in a given context can depend on many factors: 
student characteristics, features of the problem under discussion, 
features of the dialogue context, and so on. We might not even be 
able to specify the relevant factors a priori. It is quite likely that we 
find that the decision rules suggested by our analyses are under- 
specified and in need of refinement. Although most of these rules, 
as stated, could apply to any scientific, problem-solving domain, 
their generalizability remains to be tested. A combination of au- 
tomated approaches and carefully controlled, experimental studies 
of “tuned” versions of these decision rules and others will bring 
tutoring researchers closer to cracking the code of interactivity and 
developing more effective tutoring systems as a result. 
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This article describes a comprehensive approach to fully automated assessment of children’s oral reading 
fluency (ORF), one of the most informative and frequently administered measures of children’s reading 
ability. Speech recognition and machine learning techniques are described that model the 3 components 
of oral reading fluency: word accuracy, reading rate, and expressiveness. These techniques are integrated 
into a computer program that produces estimates of these components during a child’s 1-min reading of 
a grade-level text. The ability of the program to produce accurate assessments was evaluated on a corpus 
of 783 one-min recordings of 313 students reading grade-leveled passages without assistance. Established 
standardized metrics of accuracy and rate (words correct per minute [WCPM]) and expressiveness 
(National Assessment of Educational Progress Expressiveness scale) were used to compare ORF 
estimates produced by expert human scorers and automatically generated ratings. Experimental results 
showed that the proposed techniques produced WCPM scores that were within 3-4 words of human 
scorers across students in different grade levels and schools. The results also showed that computer- 
generated ratings of expressive reading agreed with human raters better than the human raters agreed with 
each other. The results of the study indicate that computer-generated ORF assessments produce an 
accurate multidimensional estimate of children’s oral reading ability that approaches agreement among 
human scorers. The implications of these results for future research and near term benefits to teachers and 
students are discussed. 


Keywords: oral reading fluency, automated reading assessment, expressive reading, automatic speech 


recognition 


Reading assessments provide school districts and teachers with 
critical and timely information for identifying students who need 
immediate help; for making decisions about reading instruction; 
for monitoring individual student’s progress in response to instruc- 
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tional interventions; for comparing different approaches to reading 
instruction; and for reporting annual outcomes in classrooms, 
schools, school districts, and states. One of the most common tests 
administered to primary school students is oral reading fluency 
(ORF). Over 25 years of scientifically based reading research has 
established that fluency is a critical component of reading and that 
effective reading programs should include instruction in fluency 
(Fuchs, Fuchs, Hosp, & Jenkins, 2001; Kuhn & Stahl, 2000; 
National Reading Panel, 2000). Although ORF does not measure 
comprehension directly, there is substantial evidence that estimates 
of ORF predict future reading performance and correlate strongly 
with comprehension (Fuchs et al., 2001; Shinn, 1998). According 
to Wayman, Wallace, Wiley, Tiché, and Espin (2007), ORF is a 
valid indicator of comprehension in early grades, though less so 
beyond Grade 4. Because ORF can be measured rather quickly 
(typically in 5-10 min) with good validity and reliability, it is 
widely used to screen individuals for reading problems and to 
measure reading progress over time. 

In this article, we present a comprehensive approach to assess- 
ing ORF accurately and automatically through the use of speech 
recognition and machine learning techniques. The approach is 
comprehensive because all three measures of ORF—accuracy, rate 
(combined into a words correct per minute [WCPM] score), and 
expressiveness—can be measured automatically and in real time, 
whereas expressiveness is rarely scored in real-world educational 
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contexts. The ultimate goal of research leading to fully automatic 
and comprehensive assessment of ORF is to provide an accurate, 
accessible, and low-cost alternative to human-administered assess- 
ments. Successful outcomes of research in this area would sub- 
stantially reduce the millions of hours that teachers spend each 
year assessing their students’ reading abilities, which is mandated 
by federal law in the United States. In addition, computer-based 
assessments of ORF could generate detailed records of individual 
student’s performance, including the digital recordings of each 
reading session that could be reviewed by teachers, parents, and 
students, and analyzed automatically for detailed information 
about the student’s reading problems. Automatic administration of 
ORF will also enable collection of massive amounts of speech data 
that can be used to analyze and understand children’s development 
of reading skills; these data can also be used to improve the 
performance of the speech recognition technologies. 

We used a speech recognition system (Bolafios, 2012) specifi- 
cally designed to process children’s read speech to produce a 
word-level hypothesis of what the student read from a grade-level 
text during 1 min. From this hypothesis and the text passage, a 
WCPM score was computed reflecting the student’s reading ac- 
curacy and rate. In order to assess prosodic reading, we developed 
a series of lexical and prosodic features that were extracted from 
the student’s speech. These included analysis of the text syntax and 
its correlation with filled pauses and silence regions, syllable and 
word duration, pitch, and word co-occurrences, among other fea- 
tures described below. Machine learning classifiers were trained 
on these features, resulting in statistical models that were able to 
discriminate between different degrees of prosodic reading using 
the National Assessment of Educational Progress ORF Scale 
(NAEP; Daane, Campbell, Grigg, Goodman, & Oranje, 2005). A 
hierarchical classification scheme was used in order to assign 
1-min reading sessions to levels in the NAEP scale. 

The accuracy of these assessment methods was evaluated on 
approximately 13 hr of speech collected from the 313 first- 
through fourth-grade students who read grade-level text passages. 
WCPM scores as well as NAEP assessments generated by the 
system, FLuent Oral Reading Assessment (FLORA), were com- 
pared with those produced by at least two independent human 
judges. 

The remainder of the article is organized as follows: The next 
section provides the scientific rationale for assessing ORF. We 
then describe the corpus of children’s read speech that was col- 
lected for this study. We then describe the system and features 
used to assess WCPM (accuracy and rate) and expressive reading 
using lexical and prosodic features extracted from the speech. The 
last section presents the discussion and conclusions. 


Scientific Rationale for FLORA 


ORF 


ORF is typically defined as a student’s ability to read words in 
grade-level texts accurately and effortlessly, at a natural speech 
rate and with appropriate prosodic expression. A synthesis of 
scientifically based reading research by the National Reading 
Panel (2000) concluded that 


Reading fluency is one of several critical factors necessary for reading 
comprehension, but it is often neglected in the classroom. If children 
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read out loud with speed, accuracy and proper expression, they are 
more likely to comprehend and remember the material than if they 
read with difficulty and in an inefficient way. 


Accuracy and automaticity. Accurate reading speed is both a 
strong discriminator of reading ability (e.g., Jenkins, Fuchs, van 
den Broek, Espin, & Deno, 2003; Perfetti, 1985) and a strong 
predictor of later reading proficiency (Lesgold & Resnick, 1982; 
Scarborough, 1998; see review by Compton & Carlisle, 1994.) As 
Jenkins et al. (2003) put it: “Together with listening comprehen- 
sion, word-reading skill accounts for nearly all of the reliable 
variance in reading ability, and individual differences in word 
recognition explain significant variance in reading ability, even 
after controlling for reading comprehension” (Curtis, 1980; 
Hoover & Gough, 1990). 

ORF depends on the ability to recognize words in a text 
quickly and automatically. As defined by Fuchs et al. (2001), 
automaticity is “the oral translation of text with speed and 
accuracy.” Automaticity theory (LaBerge & Samuels, 1974; 
Samuels, 1985; Wolf, 1999) and related verbal efficiency ac- 
counts of reading (Perfetti, 1985) hold that students who have 
learned to decode printed words automatically are able to 
devote more attention (cognitive resources) to comprehending 
what they are reading. Readers who have not achieved automa- 
ticity during word recognition must devote significant attention 
to recognizing words (at the expense of devoting this attention 
to making sense of the text), resulting in slower reading times 
and weaker comprehension. Support for automaticity and verbal 
efficiency theories of reading is provided by the strong associ- 
ation between the speed of reading words, either in word lists or 
in context, and measures of reading comprehension. 

Expressiveness. Although readers who have achieved flu- 
ency can read texts rapidly and accurately, they may not read 
expressively (1.e., they may not pause between sentences, at 
major phrase boundaries within sentences, or produce appro- 
priate prosody when reading out loud). Expressive reading is 
the third critical component of reading fluency, typically de- 
fined as reading a text with the appropriate expression, intona- 
tion, and phrasing in order to preserve meaning (Miller & 
Schwanenflugel, 2008). 

Connection between ORF and comprehension. For over 25 
years, researchers have documented the association between 
reading fluency and comprehension. Reviews of the research on 
ORF have demonstrated consistently moderate to strong corre- 
lations between ORF and comprehension (Marston, 1989; 
Shinn, 1998). Research results have demonstrated high concur- 
rent validity between ORF and measures of word recognition 
and reading comprehension (Hosp & Fuchs, 2005; Jenkins et 
al., 2003), and between ORF and nationally normed standard- 
ized tests of reading comprehension (Roehrig, Petscher, Nettles, 
Hudson, & Torgesen, 2008; Schilling, Carlisle, Scott, & Zeng, 
2007; Schwanenflugel et al., 2006). Measures of ORF in early 
grades have also been found to predict comprehension in later 
grades (Kim, Petscher, Schatschneider, & Foorman, 2010). 
Thus, the relation between ORF and reading comprehension has 
been well established by previous research, particularly for 
students in elementary school (Kim et al., 2010; Roberts, Good, 
& Corcoran, 2005; Roehrig et al., 2008). 
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Previous Work Using Automatic Speech Recognition 
to Assess and Improve ORF 


Automatic assessment of reading accuracy and rate. Over 
two decades of research has investigated the use of automatic 
speech recognition (ASR) to assess and improve reading. Seminal 
research conducted by Jack Mostow and his colleagues in Project 
Listen at Carnegie Mellon University has demonstrated the effec- 
tiveness of ASR for improving reading fluency and comprehension 
for both native and nonnative speakers of English (Mostow et al., 
2003; Reeder, Shapiro, & Wakefield, 2007). Mostow et al. (2003) 
used an ASR system to measure a student’s interword latency, 
defined as the elapsed time between certain words read aloud by 
the student that were scored as correctly read by the ASR system. 
Their model of interword latency produced a correlation of over .7 
with independent WCPM measures of ORF using grade-level 
passages. 

In the context of Project Tball (Technology Based Assessment 
of Language and Literacy) at the University of California, Los 
Angeles and University of Southern California, Black, Tepperman, 
Lee, and Narayanan (2008) investigated oral reading of 55 isolated 
words produced by kindergarten, first-, and second-grade children 
with the aim of detecting reading miscues automatically, such as 
sounding-out, hesitations, whispering, elongated onsets, and ques- 
tion intonations. Black et al. developed an ASR system that used 
specialized grammars to model word-level disfluencies using the 
subword-modeling approach developed by Hagen and Pellom 
(2005). Scores produced by the recognition system correlated 
highly (.91) with fluency judgments provided by human listeners. 

A series of studies by Bryan Pellom and Andreas Hagen and 
their collaborators (Hagen, Pellom, & Cole, 2007) investigated 
ways to optimize an ASR system for children’s read speech. The 
research resulted in a reduction in the word error rate from 17.4% 
to 7.6%. Hagen et al. (2007) developed a version of the ASR 
system that used subword-modeling rather than whole-word scor- 
ing to detect reading errors. In the study, several subword lexical 
units and approaches were evaluated for detection of reading 
disfluencies, and modest gains were reported. Bolafios (2008) 
reported that additional detection gains were achieved by using 
syllable graphs to represent hypotheses from the ASR system. 

Automatic assessment of expressive oral reading. Although 
the National Reading Panel (2000) and research community define 
ORF in terms of word recognition accuracy, reading rate, and how 
expressively the student reads (see Kuhn, Schwanenflugel, & 
Meisinger, 2010, for a discussion of this topic), expressiveness is 
rarely measured in assessments of ORF. Only recently has the 
expressiveness aspect of the reading fluency construct found its 
way into automated assessments of fluency. Duong, Mostow, and 
Sitaram (2011) investigated two alternative methods of measuring 
prosody during children’s oral reading. The first method, which 
was text dependent, consisted of generating a prosodic template 
model for each sentence in the text. The template was based on 
word-level features like pitch, intensity, latency, and duration 
extracted from fluent adult narrations. The second method inves- 
tigated adult narrations to train a general duration model that could 
be used to generate expected prosodic contours of sentences for 
any text, so an adult reader was no longer required to generate 
sentence templates for each new text. Both methods were evalu- 
ated for their ability to predict student’s scores on fluency and 
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comprehension tests, and each produced promising results, with 
the second, automated method for generating prosodic sentence 
templates outperforming the system that compared children’s read 
speech with adult narrations of each individual sentence in the text. 
However, neither of these methods could satisfactorily classify 
sentences using the NAEP expressiveness rubric relative to human 
judgments, which was probably due to the low human interrater 
reliability reported in this study. 


Development of the FLORA System 


Development of a Corpus for Assessing ORF 


Data collection setting. Data were collected from 313 first- 
through fourth-grade students in four elementary schools (nine 
classrooms) in the Boulder Valley School District in Colorado. 
Data were collected from students in their classrooms at their 
schools. School 1 had 53.8% students receiving free or reduced 
lunches, and the lowest literacy achievement scores of the three 
schools on the Colorado state literacy test given to third-grade 
students; 53% third-grade students in School | scored proficient or 
above on the state reading assessment. School 2 had 51.7% stu- 
dents with free or reduced lunch (similar to School 1), but 79% of 
third-grade students tested as proficient or above on the state 
literacy test. School 2 was a bilingual school with nearly 100% 
English learners (ELs) who spoke Spanish as their first language. 
School 3 had 18.4% of students with free or reduced lunch, 85% 
of students were proficient or above in the state literacy test. 
School 3 also had relatively few ELs. 

Text passages. Twenty text passages were available for read- 
ing at each grade level. The standardized text passages were 
downloaded from a website (Good & Kaminski, 2002) and are 
freely available for noncommercial use. The text passages were 
designed specifically to assess ORF and are about the same level 
of difficulty within each grade level. ORF norms have been col- 
lected for these text passages for tens of thousands of students at 
each grade level in fall, winter, and spring semesters, so that 
students can be assigned to percentiles based on national WCMP 
scores (Hasbrouck & Tindal, 2006). 

Data collection protocol. The data were collected using the 
FLORA system (Bolafios, Cole, Ward, Borts, & Svirsky, 2011), 
which was configured to enroll each student, randomly select one 
passage from the set of 20 standardized passages for the student’s 
grade level, and present the passage to the student for reading out 
loud. Because testing was conducted in May, near the end of the 
school year, classroom teachers had recently assessed their stu- 
dent’s oral reading performance (using text passages different 
from those used in our study). About 20% of the time, teachers 
requested that specific students be presented with text passages 
either one or two levels below or one or two levels above the 
student’s grade level. Thus, about 80% of students in each grade 
read passages at their grade level, whereas 20% of students read 
passages above or below their grade level, based on their teachers’ 
recommendations. Depending on the number of students who 
needed to be tested on a given day, each student was presented 
with two or three text passages to read aloud. 

During the testing procedure, the student was seated before a 
laptop and wore a set of headphones with an attached noise- 
cancelling microphone. The experimenter observed or helped the 
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student enroll in the session, which involved entering the student’ s 
gender, age, and grade level. FLORA then presented a text pas- 
Sage, started the 1-min recording at the instant the passage was 
displayed, recorded the student’s speech, and relayed the speech to 
a server. 

Corpus summary. The corpus comprised 783 recordings 
from 313 first- through fourth-grade students for a total of approx- 
imately 13 hr of speech data. Each recording was scored manually 
by two human judges. Words were scored as reading errors if the 
word was skipped over, or the judge decided that the word was 
misread. Insertions of words (intrusions) were not scored as read- 
ing errors, as insertions were not counted as errors in the national 
norms collected by Hasbrouck and Tindal (Hasbrouck & Tindal, 
2006). 


Automatic Generation of WCPM Scores 


The number of words that a student read correctly during 1 min 
was computed automatically by ReadToMe, the reading tracker 
built on top of our ASR system (Bolafios, 2012). The computation 
of the WCPM score was done as follows. (a) ReadToMe used the 
Bavieca speech recognition toolkit (Bolafios, 2012) to produce a 
word-level hypothesis representing what the student read. (b) 
ReadToMe aligned the hypothesis to the reference text (the words 
in the text passage the student read) and tagged each of the words 
in the reference text as correctly or incorrectly read or skipped 
over. (c) Finally, ReadToMe counted the number of words scored 
as correctly read during the 1-min reading; this number is the 
resulting WCPM score for the text passage. 


Automatic Assessment of Expressive Reading 


In order to assess expressive reading automatically, we proposed 
a set of lexical and prosodic features that can be used to train a 
machine learning system to classify how expressively students 
read text passages aloud using the 4-point NAEP scale. The 
proposed features were designed to measure the speech behaviors 
associated with each of the four levels of fluency described in the 
NAEP rubric and were informed by research on acoustic-phonetic, 
lexical, and prosodic correlates of fluent and expressive reading 
described in the research literature (Kuhn et al., 2010). Features 
were extracted from multiple sources, including the recognition 
hypothesis, a pitch-extractor, and a syllabification tool. Features 
included the WCPM score itself, the speaking rate, sentence read- 
ing rate, number of word repetitions, location of the pitch accent, 
word and syllable durations, and filled and unfilled pauses and 
their correlation to punctuation marks in the text passage. A 
detailed description, motivation, and analysis of all the features 
proposed and used for the study can be found in Bolafios et al. 
(2013). 

Classification method. In order to classify the 783 one-min 
recordings using the features proposed, we used a powerful clas- 
sification technique called support vector machines (Vapnick, 
1995). We experimented with difference classification strategies 
and found a strategy based on a decision directed acyclic graph 
(DAG) to be most successful (Platt, Cristianini, & Shawe-Taylor, 
2000). The DAG approach makes sense conceptually because it 
maps directly to the NAEP scale; that is, it distinguishes disfluent 
reading (Levels 1 and 2 in the NAEP scale) from fluent reading 
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(Levels 3 and 4 in the NAEP scale) and then makes finer distinc- 
tions (1 vs. 2 and 3 vs. 4). To implement the DAG strategy, we 
trained three classifiers. The first classifier was trained on samples 
from all classes and separated samples from Classes 1 and 2 and 3 
and 4. This classifier was placed at the root of the tree, whereas 
two other classifiers, trained on samples from Classes | and 2 and 
3 and 4, respectively, were placed on the leaves to make the 
finer-grained decisions. A detailed description of the classification 
scheme can be found in Bolafios et al. (2013). 


Speech Recognition System 


A total of 106 hr or read speech from three different children’s 
speech corpora were used to train the recognition system. The 
recognizer was not trained on the corpus of read speech, described 
above, that was used to evaluate FLORA. We note that the system 
is text independent; that is, for new text passages, the system 
automatically generates the expected pronunciation(s) of each 
word in a text passage from a pronunciation dictionary. 

The speech recognition system combines two main sources of 
information to produce a score for each word. These sources are 
(a) the score produced by matching the system’s acoustic models 
for the expected sequence of phonemes in a word (based on a 
pronunciation dictionary) to the student’s pronunciation of the 
word and (b) the probability of the word occurring in the text (the 
statistical language model, based on the co-occurrence of words in 
the text passage). These two sources of information are combined 
to produce the most likely hypothesis string given the speech input. 
Additionally, phone-level alignments from each of the 1-min re- 
cordings were generated automatically for feature extraction pur- 
poses. Two complementary speaker adaptation techniques were 
used in order to tailor the speaker-independent acoustic models to 
the speech characteristics and vocal tract length of each speaker. 


Comparison Between Automated and Human 
Assessments of ORF 


Human Scoring of Recorded Sessions 


In order to evaluate the ability of FLORA to produce reliable 
WCPM scores, each of the 783 one-min recordings in the evalu- 
ation corpus was scored independently by two former elementary 
school teachers. Each teacher had more than a decade of experi- 
ence administering reading assessments to elementary school chil- 
dren. The scorers were able to listen to, review, and modify their 
judgments within each recording until they were satisfied with 
their WCPM score. Thus, they were allowed to listen to the 
recording more than once. 

Additionally, each of the 783 recordings was scored from | to 4 
using the NAEP ORF scale by at least two independent scorers, 
who were former elementary school teachers with experience 
assessing reading proficiency. A set of 70 stories of the total 783 
stories were scored by the five available teachers, whereas the 
other recordings were scored by just two of them, which were 
randomly assigned to each scorer. A training session was sched- 
uled before the scoring process to review the NAEP scoring 
instructions and unify criteria. The judges first listened to passages 
rated by two experienced researchers whose area of expertise is 
expressive reading (Paula Schwanenflugel and Melanie Kuhn). 
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The teachers who scored the stories then rated these passages and 
compared their ratings with the experts. The teachers then rated 
several additional passages and reviewed their ratings based on the 
definitions of each of the NAEP levels. The training was con- 
cluded when the teachers’ level of agreement approximated the 
agreement exhibited by the two experts. 

For the actual scoring of the evaluation corpus, the judges 
listened to each 60-s story in 20-s intervals and provided a 1—4 
rating for each interval. The NAEP ORF scale (Daane et al., 2005) 
comprises four levels from less to more fluent. Level 1 is charac- 
terized by word-by-word reading, Level 2 by reading using two- 
word phrases with some three- or four-word groupings, and Level 
3 is characterized by a majority of three- or four-word phrase 
groups while preserving the syntax of the author. Readers at Level 
4 produce larger, meaningful phrase groups with expressive inter- 
pretation. Finally, scorers attached a global NAEP score to the 
recording based on the NAEP scores assigned to each 20-s seg- 
ment. The global score was based on a review of the scores and 
their best judgment rather than using a deterministic method like 
the mean or mode. 


Assessment of Reading Accuracy and Automaticity 


Table 1 shows the means and standard deviations (between 
parentheses) for accuracy, words per minute (WPM), and WCPM 
scores for the human scorers and FLORA. Statistics are shown per 
reading level for students in the four schools. As noted above, 
although the evaluation data were collected from students from 
Grades 1 to 4, about 20% of the time, teachers requested that 
specific students be presented with text passages either one or two 
levels below or above the student’s grade level, resulting in read- 
ing levels for text passages from Grades 1 to 6. In Table 1, 
accuracy is expressed in percentages and WPM, which measures 
fluency from the perspective of speed-ignoring accuracy, and the 
score is based on the average across the two human scorers for 
each recording. It can be seen that accuracy (percentage of words 
read correctly) is higher for higher grade levels, from 70.3% for 
first grade to 92.6% and 90.5% for fifth- and sixth-grade levels, 
respectively. WPM are displayed in Column 5 for each grade level; 
as expected, they are highly correlated, with WCPM measured by 
human scorers (Column 5); however, WCPM computed by 
FLORA (Column 7) are much closer to human WCPM scores 
(Column 6) than WPM. 

A major result can be observed by comparing the WCPM scores 
from the human scorers and FLORA, which present very similar 
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distributions (means and standard deviations). In addition, we 
observed very similar distributions of WCPM scores from humans 
and FLORA within each of the nine classrooms in which we 
conducted the study, even for classrooms in schools in which the 
majority of students spoken Spanish as their first language and 
were officially designated as English learners. 

Column 8 shows the expected number of WCPM for each grade 
level according to Hasbrouck and Tindal (2006) reading norms. It 
can be seen in the table that students were assigned by teachers to 
reading levels at which they read around the 50th percentile. We 
believe that there is no credible evidence to link higher WCPM 
scores to improved comprehension, but there is substantial support 
for the need for readers to have an accuracy and rate (WCPM 
score) in the range of the 50th percentile to support both compre- 
hension and motivation. 

Another pattern of results is revealed by examining the numbers 
in Column 9, which shows the mean difference in WCPM scores 
for the two human scores for the recordings in each classroom, and 
the numbers in Column 10, which shows the mean difference 
between the averaged human scores and FLORA for each class- 
room. Note that differences in WCPM scores are expressed in 
absolute value. Viewing the numbers in Column 9 reveals the 
remarkable agreement between the two human scorers (1.2- 
WCPM difference across all schools) and the low variance. Across 
all recordings, the mean difference between FLORA and the 
averaged human scores was 3.6 words, whereas the mean differ- 
ence between human scores was 1.2 words. 

Figure 1a displays a scatter plot of the WCPM scores from the 
two human scorers for all recordings, whereas Figure 1b displays 
a scatterplot of the WCPM scores from FLORA with respect to the 
average human scores for all recordings. If agreement were per- 
fect, all points would lie on the diagonal. These figures show the 
strong agreement between WCPM scores for human scorers on 
each recording, and the very good agreement between FLORA and 
the human scores, with relatively few outliers. 

We were interested in determining whether FLORA might be a 
useful tool for providing WCPM scores that could be used as one 
valuable indicator, along with other measures, to identify students 
who are at risk for failing to learn to read. One way to do this is 
to compare human and FLORA WCPM scores with the national 
reading norms developed by Hasbrouck and Tindal (2006), which 
measured WCPM scores, for first- through sixth-grade students 
during each trimester of a school year. The interrater agreement in 
the task of mapping recorded stories to percentiles was 0.97 for the 


Summary of Accuracy, WPM, and WCPM According to Human Scorers (H) and FLORA (F). Expected WCPM (E ) Are Also Shown 
ee a ara ae eee ee 


Level Stu. Rec. Acc. (%) H-WPM H-WCPM F-WCPM E-WCPM H-diff FH-diff 
aa ae 
1 68 171 70.3 (19.7) 54.6 (25.5) 41.9 (26.4) 42.5 (25.9) 53 1.2 (1.8) Dli(25)) 

2 97 242 84.6 (10.1) DO Sng) 85.7 (33.1) 86.1 (31.8) 89 1.2 (2.0) 3.8 (4.4) 

3 52 128 87.3 (7.6) 113.4 (28.1) 100.1 (29.6) 101.6 (28.0) 107 1.2 (1.4) 3.6 (2.8) 

4 59 147 87.4 (8.1) 124.4 (26.6) 109.9 (27.3) MPT (Pikesy 123 1.3 (1.8) 4.1 (3.1) 

5 30 76 92.6 (3.6) 156.9 (26.4) 145.6 (26.6) 145.6 (24.5) 139 LED) 4.6 (4.5) 

6 q 1) 90.5 (14.1) 145.9 (46.1) 137.3 (49.6) 137.4 (49.1) 150 1.5 (2.0) 2.8 (2.6) 
All as 783 83.3 (14.0) 103.3 (42.2) 90.1 (43.1) 91.1 (42.6) 1.2 (1.8) 3.6 (3.6) 


eee eee 
Note. WPM = words per minute; WCPM = words correct per minute; FLORA = FLuent Oral Reading Assessment; Stu. = number of students; Rec. 
= number of recordings; Acc. = accuracy; H-diff = difference between the human scorers; FH-diff = difference between the FLORA and human scorers. 
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Correlation between WCPM scores produced by two independent human scorers (a) and between 


FLORA and the average of the two independent human scorers (b) for each of the 1-min recordings assessed. 
WCPM = words correct per minute; FLORA = FLuent Oral Reading Assessment. 


human scorers and 0.89 between FLORA and each of the human 
scorers. The interrater agreement in the task of mapping recorded 
stories above and below the 50th percentile (which is used nor- 
mally as a reference to identify at-risk students) was 0.98 for the 
human scorers and 0.92 between FLORA and each of the human 
scorers. Agreement was computed using the weighted kappa co- 
efficient (k) (Cohen, 1968), which is suitable for ordinal catego- 
ries. In sum, the interhuman agreement and the FLORA to human 
agreement is very close, which means that FLORA performs well 
at identifying students who might require additional reading as- 
sessments and instruction. 


Assessment of Expressive Reading 


In this section, we present results on assessing expressive 
oral reading using FLORA. First, we briefly analyze the clas- 
sification accuracy for the lexical and prosodic features pro- 
posed in relation to human assessments. We then analyze agree- 
ment and correlation between human scores and FLORA’s 
automatic scoring system using the NAEP scale. 

Classification accuracy. In order to derive the most effective 
combination of features to assess expressive reading, we measured 
the classification accuracy (percentage of recordings that FLORA 
assigned the same label than the human labelers) of FLORA on the 
corpus described above. Each recording was labeled by FLORA 
according to the NAEP scale, and labels were compared with those 
from all the available human labelers. We note that there exists an 
upper bound to the classification accuracy that can be attained by 
the classifier. The reason is that whenever the human raters score 
the same recording differently, there is an unrecoverable classifi- 
cation error. 

Results showed that both lexical and prosodic features contrib- 
uted similarly to the classification accuracy for the NAEP-2 (dis- 
fluent vs. fluent) task (89.27% and 89.02%, respectively). This can 
be initially considered an unexpected result because lexical aspects 
like the number of words read correctly are expected to dominate 
the discrimination between fluent and nonfluent readers. However, 


it is important to note that some of the prosodic features defined in 
this study are highly correlated with the lexical features. For 
example, it is obvious that the number of words correctly read in 
a |-min reading session should correlate highly with the average 
duration of a silence region or the number of filled pauses made. 

For both the NAEP-2 and NAEP-4 tasks, lexical and prosodic 
features provided complementary information that led to improved 
classification accuracy when combined. For the NAEP-4 tasks, 
lexical features seem to have a dominant role (73.24% and 
69.73%, respectively). We attribute this to the WCPM score, 
which is taken as a lexical feature; this score by itself provides a 
71.78% accuracy for the NAEP-4 task. As expected, the automat- 
ically computed WCPM, which comprises two of the three reading 
fluency cornerstones (accuracy and rate), plays a fundamental role. 
In particular, the combination resulted in accuracies of 90.72% and 
75.87% for the NAEP-2 and NAEP-4 tasks, respectively. Finally, 
note that the distribution of recordings across the NAEP levels 
according to humans and machine was very similar. 


Interrater Agreement and Correlation 


In this section, we present interrater agreement and correlation 
results for the best system from the previous section (multilabel 
training using all the features). Table 2 shows the interrater agree- 
ment for the tasks of classifying recordings into the broad NAEP 
categories (fluent vs. nonfluent; NAEP-2), or the four levels of 
expressiveness using the NAEP rubric (NAEP-4). For the NAEP-2 
task, the interrater agreement was measured using Cohen’s kappa 
coefficient (x) (Cohen, 1960); p(a) is the probability of observed 
agreement, whereas p(e) is the probability of chance agreement. 

For the NAEP-4 task, we measured the interrater agreement 
using the weighted kappa coefficient (k) (Cohen, 1968), which is 
more suitable for ordinal categories given that it weights disagree- 
ments differently depending on the distance between the categories 
(we used linear weightings). As a complementary metric for this 
task, we computed the Spearman’s rank correlation coefficient 
(Spearman, 1904). In a number of classification problems, like 
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Interrater Agreement and Correlation Coefficients on the NAEP Scale 


Scorer # recordings p(a) 
Human 1 HAL 0.87 
Human 2 391 0.90 
Human 3 698 0.87 
Human 4 799 0.86 
Human 5 367 0.86 
FLORA 1,776 0.94 


NAEP-2 NAEP-4 

ple) K K p 

0.50 0.73 0.66 0.80 
0.50 0.80 0.69 0.81 
0.50 0.74 0.68 0.81 
0.50 0.71 0.69 0.81 
0.50 0.71 0.68 0.80 
0.50 0.84 0.77 0.86 





Note. NAEP = National Assessment of Educational Progress; FLORA = FLuent Oral Reading Assessment. 


emotion classification, the data are annotated by a group of human 
raters who may exhibit consistent disagreements on similar classes 
or similar attributes. In such classification tasks, it is inappropriate 
to assume that there is only one correct label because different 
individuals may consistently provide different annotations (Steidl, 
Levit, Batliner, N6th, & Niemann, 2005). Although the NAEP 
scale is based on clear descriptions of reading behaviors at each of 
four levels, children’s reading behaviors can vary across these 
descriptions while reading, and individuals scoring the stories may 
differ consistently in how they interpret and weight children’s oral 
reading behaviors. For this reason, we believe that examining 
correlations between human raters and between human raters and 
the machine classifiers is a meaningful and useful metric for this 
task. 

Each row in Table 2 shows the agreement and correlation 
coefficients of each rater with respect to the other raters (excluding 
FLORA in the case of the human raters; note that not all the 
scorers scored the same number of recordings). In order to inter- 
pret the computed kappa values, we have used as a reference the 
interpretation of the kappa coefficient provided in Landis and 
Koch (1977), which attributes good agreement to kappa values 
within the interval (0.61—0.80) and very good agreement to higher 
kappa values (0.81—1.00). According to this interpretation, Table 2 
reveals that (a) there is good interhuman agreement for both the 
NAEP-2 and NAEP-4 tasks, (b) there is good FLORA-to-human 
agreement for the NAEP-4 task, and (c) there is very good 
FLORA-to-human agreement for the NAEP-2 task. It can be 
observed that the kappa agreement between FLORA and the 
humans is higher than the agreement between each human scorer 
and the rest of the human scorers. This is true for both the NAEP-2 
and NAEP-4 tasks. This difference in agreement is statistically 
significant, which indicates the ability of the proposed features and 


Table 3 


classification scheme to provide a useful method to automatically 
assess expressive oral reading using the NAEP scale. 

In terms of the Spearman’s rank correlation coefficient (p), we 
obtained relatively strong interhuman correlation (.80—.81) and an 
even stronger machine-to-human correlation (.86) in the NAEP-4 
task. This indicates that NAEP scores from every pair of scorers 
are closely related, which is consistent with the weighted kappa 
values obtained. 

In Table 3, we display cross-tabs of agreement and disagreement 
between humans and between FLORA and humans (in percent- 
ages). In both cases, most of the data lie in the main diagonal, and 
we believe that there are no obvious biases between humans and 
FLORA. 


Connection Between Reading Accuracy, Reading Rate, 
and Expressive Reading 


We conducted a set of analyses to gain insights into the rela- 
tionship between the two main measures of ORF, WCPM, and 
expressiveness. These analyses are displayed in Figure 2a and 2b. 
In each panel of the figure, we sorted students according to their 
WCPM percentile using the Hasbrouck and Tindal (2006) norms. 
Thus, the leftmost bar of each panel represents students with 
WCPM scores below the 10th percentile, whereas the rightmost 
bar shows students in the 90th percentile. Figure 2a displays 
percentile assignments based on average human scorers rating, and 
Figure 2b displays percentile assignments based on FLORA 
WCPM estimates. The tones of gray within each bar indicate the 
percentage of students at each NAEP score; in Figure 2a, these 
numbers are based on the NAEP scores assigned by the human 
scorers, and in Figure 2b these numbers were assigned by FLORA. 


Cross-Tabs of Agreement/Disagreement Between FLORA and Human-Generated NAEP Scores (in %) 


nnn 








FLORA Human 
I p 3 4 1 2 3 4 
1 16.6 2.9 0.1 0 1 14.9 4.2 0.1 0 
2 3.5 21.3 3.9 0.2 2 44 19.6 5.7 0.2 
Human 3 0 3.9 32.4 5.6 ee 3 0.1 72 Ie 5.2 
4 0 0 3 6.6 4 0 0 47 6.1 


Note. FLORA = FLuent Oral Reading Assessment; NAEP = National Assessment of Educational Progress. 
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a: Distribution of recordings across the NAEP scale for each WCPM percentile according to human 


scorers. b: Distribution of recordings across the NAEP scale for each WCPM percentile according to FLORA. 
NAEP = National Assessment of Educational Progress; WCPM = words correct per minute. 


It is clear from this figure that recordings in the highest percen- 
tiles (highest reading accuracy and rate) correspond to more ex- 
pressive readers (higher levels in the NAEP scale). For example, 
all of the recordings for students in the 90th percentile based on 
WCPM were assigned to Levels 3 and 4 in the NAEP scale. 
Moreover, about 97.0% of the recordings below the 10th percentile 
were assigned to Levels 1 and 2 in the NAEP scale. Figures 2a and 
2b reveal several interesting patterns: A significant percentage of 
recordings placed below the 50th percentile (which might be used 
to identify students in need for fluency support) were placed in the 
higher levels of the NAEP scale according to our expert human 
annotators (3.08%, 24.02%, and 45.19% for recordings below the 
10th percentile, in the 10th percentile, and in the 25th percentile, 
respectively). This means that there are a number of speakers who, 
despite reading below the expected rate according to the percen- 
tiles published by Hasbrouck and Tindal (2006), read with appro- 
priate/good expression and would be considered fluent readers 
according to the NAEP scale. Another interesting observation is 
that a significant percentage of recordings placed above the 50th 
percentile were assigned to the lower levels in the NAEP scale by 
our expert human annotators. Those recordings likely correspond 
to speakers who are reading for speed rather than for comprehen- 
sion in order to get as many words read as possible within the 
1-min session. In particular, 24.88% of the recordings in the 50th 
percentile were assigned to Levels 1 and 2 in the NAEP scale 
(nonfluent), whereas 13.92% of the recordings in the 75th percen- 
tile were assigned to those levels. We note that the instructions 
provided to students before recording stories emphasized the im- 
portance of reading the text naturally, rather than as fast as they 
could; these percentage might have been higher if we had not 
emphasized reading naturally in the instructions. These observa- 
tions suggest that measuring both expressiveness and WCPM is 
likely to be both informative and beneficial to understanding 


individual student’s oral reading abilities. Finally, we note that 
Figure 2b, which is analogous to Figure 2a but was built using 
FLORA scores, presents very similar information. 


Discussion and Conclusions 


We investigated the automatic assessment of ORF in children’s 
speech according to two standard rubrics: WCPM (to measure 
accuracy and rate) and the NAEP Expressiveness scale. Compared 
with human scoring of WCPM and expressiveness on 783 one-min 
recordings of children reading grade-level text passages, results 
show that automatically generated WCPM scores differ by an 
average of 3.5 words with respect to the human-average score for 
each recorded story, whereas humans differ by an average of 1.5 
words for each story. 

For expressiveness, FLORA had an accuracy of 90.93% classi- 
fying recordings according to the binary NAEP scale (“fluent” vs. 
“nonfluent”) and 76.05% on the more difficult 4-point NAEP 
scale. According to the classification of kappa strength proposed 
by Landis and Koch (1977), the kappa agreement for both NAEP-2 
and NAEP-4 tasks between each human scorer and the rest of the 
human scorers was good, whereas the kappa agreement between 
the machine and the human scorers was good and very good, 
respectively. In addition, the kappa agreement between FLORA 
and each human scorer was always significantly higher than the 
kappa agreement between the human scorers. In terms of the 
Spearman’s rank correlation coefficient (p), correlation between 
the machine and each human scorer was always significantly 
higher than the correlation between human scorers. 

The results of the research reveal that speech recognition and 
machine learning systems can produce accurate assessments of 
WCPM and expressiveness that approach (WCPM) or exceed 
human performance. Without question, the results of the WCPM 
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scores reported above can be improved substantially in the near 
future using known ASR solutions, such as collecting more train- 
ing data to model children’s speech patterns. For example, Ver- 
gyri, Lamel, and Gauvain (2010) reported that accent-dependent 
acoustic modeling (which implies training/adapting on data from 
the target accent) produces a significant increase in recognition 
performance compared with accent-independent modeling. In a 
recent study that we conducted on 191 native Spanish children 
learning to read English text in Spanish schools (Bolafios, Elhazaz, 
Ward, & Cole, 2012), we determined experimentally that statistical 
models trained on speech from the target population were signif- 
icantly more accurate than models trained on native English chil- 
dren. Results from that study showed a mean difference in WCPM 
scores of 5.49 and 4.96, respectively, between FLORA and each of 
the human scorers, whereas the mean difference between the 
human scorers was about 5.92 words. 

Perhaps the major limitation of this study is the relatively small 
number of students (313) used in our research. To fully demon- 
strate the feasibility and validity of a fully automatic assessment of 
ORF, speech data during oral reading of leveled texts must be 
collected for a large and diverse population of students at different 
grade levels, representing students with different dialects and 
accents. The system must also be tested with data collected from 
many different classrooms or computer labs to model the acoustic 
environments and the realities of real-world use. 


Toward Valid Automatic Assessment of ORF 


We believe there are great potential benefits of incorporating 
measures of expressiveness into assessments of ORF. One of the 
major criticisms of using WCPM to measure individual student’s 
improvements in reading over time (i.e., in response to instruction) 
is that students strive to read texts as quickly as possible in order 
to increase their WCPM scores, which teachers often set as learn- 
ing targets within a reading instruction program. When a student’s 
ability is measured in terms of how quickly he or she can read the 
words in a text, teachers and students learn to focus on reading 
fast, rather than reading the text at a normal reading rate with 
intonation and phrasing that communicates the meaning of text, 
and thus reflects its comprehension by the student. Fast readers 
have shorter segment durations, muted stress marking, and reduced 
phrase-final bracketing than slow readers, so the normal compre- 
hension benefits children might experience by reading with good 
prosody may not be derived by students who are trying to read fast 
(Benjamin & Schwanenflugel, 2010; Kuhn et al., 2010). In sum, 
the emphasis on speed that can result from using WCPM as the 
only measure of ORF may undermine the goal of helping students 
develop strategies for reading with deep understanding. 

Incorporating measures of expressiveness into assessments of 
ORF could mitigate this problem. One can easily imagine a 
weighted measure of ORF that combines WCPM and expressive- 
ness estimates, such that students receive the highest score when 
the words in a text are read at a natural speaking rate with prosody 
appropriate to the discourse structure of the text. In fact, some 
rating systems of reading expressiveness such as the Multidimen- 
tional Fluency Guide (Rasinski, Rikli, & Johnston, 2009) already 
do this. 

One of the major benefits of the automated scoring of reading 
prosody by FLORA that neither the NAEP nor the other various 
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teacher rating systems for evaluating reading fluency have is that 
these reading fluency scales have not (as yet) been grounded in 
research on reading prosody. We do not know whether the ratings 
obtained using these scales would be spectrographically valid, that 
is, that children rated as expressive on these scales would be the 
same ones who would appear expressive when their readings are 
viewed on a spectrogram. Because the features used in FLORA to 
classify expressive reading were derived directly from spectro- 
graphic measures derived from children’s speech (Kuhn et al., 
2010), FLORA can make this claim. Conversely, because the 
teacher NAEP ratings match the spectrographic distinctions made 
by FLORA, FLORA has also served to validate teacher impres- 
sions of reading prosody as determined by the NAEP. In sum, fully 
automatic assessment or ORF that combines its three components 
appears to be feasible with today’s technologies. Additional re- 
search is needed to determine how to use these measures to 
provide the most useful feedback to teachers and students to assess 
students’ reading abilities and inform instruction. 
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This article describes an advanced learning technology used to investigate hypotheses about learning by 
teaching. The proposed technology is an instance of a teachable agent, called SimStudent, that learns 
skills (e.g., for solving linear equations) from examples and from feedback on performance. SimStudent 
has been integrated into an online, gamelike environment in which students act as “tutors” and can 
interactively teach SimStudent by providing it with examples and feedback. We conducted 3 classroom 
‘in vivo” studies to better understand how and when students learn (or fail to learn) by teaching. One of 
the strengths of interactive technologies is their ability to collect detailed process data on the nature and 
timing of student activities. The primary purpose of this article is to provide an in-depth analysis across 
3 studies to understand the underlying cognitive and social factors that contribute to tutor learning by 
making connections between outcome and process data. The results show several key cognitive and 
social factors that are correlated with tutor learning. The accuracy of students’ responses (i.e., feedback 
and hints), the quality of students’ explanations during tutoring, and the appropriateness of tutoring 
strategy (i.e., problem selection) all positively affected SimStudent’s learning, which further positively 
affected students’ learning. The results suggest that implementing adaptive help for students on how to 


tutor and solve problems is a crucial component for successful learning by teaching. 


Keywords: learning by teaching, machine learning, SimStudent, teachable agent, tutor learning 


It has been widely observed that students learn by teaching 
others (e.g., E. G. Cohen, 1994). Such an effect of learning by 
teaching (also known as the tutor-learning effect) has been empir- 
ically confirmed in many different domains for many different 
structures of peer tutoring with different ages and achievement 
levels (Roscoe & Chi, 2007). Despite a long-standing history of 
empirical studies on the tutor-learning effect, not enough is known 
about the cognitive and social theory of when, how, and why tutors 
learn (or fail to learn) by teaching. 
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A primary challenge in theory development for tutor learning is 
a lack of the process data, that is, a detailed record of interactions 
between tutors and tutees. Collecting rich process data from peer 
tutoring sessions can enable descriptions of tutoring activities at a 
fine level of granularity, such as dialogue between the tutor and the 
tutee, response accuracy, and timing and sequencing of actions. 
When combined with outcome data (e.g., test scores), this detailed 
information can allow further exploration of elements of cognitive 
and social theory of tutor learning. However, such process data are 
rarely available. Roscoe and Chi (2007) reported that only six out 
of thousands of related articles report both outcome and process 
data. An obvious reason for the lack of process data is the diffi- 
culty in collecting such data during a study in which human 
students tutor their peers. 

In their meta-analysis of prior research, Roscoe and Chi (2007) 
summarized potential flaws in program design and implementation 
that might have impacted the tutor-learning effect. One way to 
avoid such flaws is to better understand the process of tutor 
learning and to provide appropriate facilities for the tutors. Knowl- 
edge gained from combined process and outcome data can aid 
iterative design engineering of more effective learning by tutoring. 

To help advance the cognitive and social theory of tutor learn- 
ing, we have developed a synthetic pedagogical agent as a tutee 
that students can interactively tutor. Such a pedagogical agent is 
often called a teachable agent, which in our case is named Sim- 
Student (Matsuda, Cohen, Sewall, Lacerda, & Koedinger, 2007). 
SimStudent engages in genuine machine learning to learn proce- 
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dural problem-solving skills. SimStudent has been integrated into 
an online, gamelike environment called APLUS (Artificial Peer 
Learning environment Using SimStudent). 

With APLUS and SimStudent, we have conducted three tightly 
controlled in vivo studies for middle-school students learning 
algebra linear equations (Matsuda, Cohen, et al., 2012; Matsuda et 
al., 2011; Matsuda, Yarzebinski, et al., 2012). Solving linear 
equations is a critical area in the early algebra curriculum, yet 
many secondary school students experience great difficulty mak- 
ing the transition from arithmetic to algebra, especially in learning 
how to solve equations (see, e.g., Bednarz & Janvier, 1996; Filloy 
& Rojano, 1989; Kieran, 1992; Linchevski & Herscovics, 1996). 
Developing an effective intervention to learn equation solving thus 
has an urgent, practical need as well. 

In this article, we investigated the following research questions: 

Question 1: Does SimStudent actually learn how to solve equa- 
tions when tutored by students in an authentic classroom setting? 
Accordingly, do students learn by teaching SimStudent? 

Question 2: How do tutor and tutee learning correlate with each 
other? 

Question 3: When and how do students learn or fail to learn by 
teaching SimStudent? 

To answer these questions, we conducted in-depth analyses 
across three in vivo studies to understand the underlying cognitive 
and social factors that contribute to tutor learning. These analyses 
benefited from both outcome and process data. 

In the rest of the article, we first provide a survey of prior 
research on the tutor-learning effect and the teachable agent tech- 
nology. We then introduce SimStudent and APLUS, with a tech- 
nical overview of how SimStudent acts as a teachable agent. Next, 
we explain how students interactively tutor SimStudent and pro- 
vide an overview of the data analysis, which includes empirical 
data collected from the three in vivo studies. We then discuss how 
and when students learn or fail to learn by teaching SimStudent 
based on the process and outcome data. We conclude with a 
discussion of directions for future research based on the lessons 
learned from our studies. 


The Tutor-Learning Effect 


The tutor-learning effect has been studied for many years (Chi, 
Siler, Jeong, Yamauchi, & Hausmann, 2001; P. A. Cohen, Kulik, 
& Kulik, 1982; Devin-Sheehan, Feldman, & Allen, 1976; Gartner, 
Kohler, & Riessman, 1971; Graesser, Person, & Magliano, 1995) 
and for different age groups, varying from elementary (Sharpley, 
Irvine, & Sharpley, 1983) to middle school (Jacobson et al., 2001, 
King, Staffieri, & Adelgais, 1998) to college (Annis, 1983; Top- 
ping, 1996). It has also been observed in various subject domains, 
including mathematics, reading, science, and social studies (P. A. 
Cohen et al., 1982; Cook, Scruggs, Mastropieri, & Casto, 1986, 
Mastropieri, Spencer, Scruggs, & Talbott, 2000; Mathes & Fuchs, 
1994: Rohrbeck, Ginsburg-Block, Fantuzzo, & Miller, 2003), and 
in different forms of tutoring, including reciprocal tutoring (Pal- 
incsar & Brown, 1984), collaborative passage learning (Bargh & 
Schul, 1980), and small-group learning as opposed to peer-to-peer 
learning (Webb & Mastergeorge, 2003). It has also been demon- 
strated that tutors can learn by just preparing for teaching (Biswas 
et al., 2001). 


Lis3 


Learning by teaching has been shown to be effective for minor- 
ity populations. Robinson, Schofield, and Steers-Wentzell (2005) 
found that African American student tutors learned more from 
math peer tutoring than White students. Rohrbeck et al. (2003) 
found a larger effect size in groups with more than 50% minority 
enrollment than groups with lower minority enrollment. Other 
researchers found positive outcomes for students from underpriv- 
ileged backgrounds (Greenwood, Delquadri, & Hall, 1989; Jacob- 
son et al., 2001) and students with learning disabilities (Cook et al., 
1986; Mastropieri et al., 2000). 

Despite the fact that many experimental studies support the 
tutor-learning effect, the actual effect size has been known to be 
rather moderate (P. A. Cohen et al., 1982; Cook et al., 1986; 
Mastropieri et al., 2000; Mathes & Fuchs, 1994; Rohrbeck et al., 
2003). The tutor-learning effect has been shown to be relatively 
more effective in math than reading. For example, P. A. Cohen et 
al. (1982) showed an effect size of .62 for math and .21 for 
reading, and Cook et al. (1986) showed an effect size of .67 for 
math and .30 for reading. 

In sum, learning by teaching has the potential to be a successful 
intervention for a wide variety of student populations across many 
disciplines. It also has the potential to minimize the achievement 
gap between student demographic diversities. Despite the popu- 
larity of the tutor-learning effect, we lack an adequate cognitive 
theory of tutor learning. Understanding the underlying cognitive 
principles of tutor learning could facilitate the development of 
effective learning technologies and may improve on the rather 
small effect size of tutor learning. 


Teachable Agent 


There are a number of advantages of using a teachable agent 
technology to study the tutor-learning effect (e.g., VanLehn, 
Ohlsson, & Nason, 1994). First, it enables implementation of tight, 
precisely determined control conditions. For example, the variance 
of tutees can be controlled by having students teach the same 
version of the teachable agent. The teachable agent technology 
also allows researchers to control the competency of the tutee to 
see how it may affect tutor learning. Second, the teachable agent 
allows researchers to conduct peer-tutoring studies without the risk 
of harming tutees. Although nonexpert tutors have a greater chance 
of teaching inaccurate knowledge, previous studies showed that 
tutors often learned at the cost of tutee errors. Walker, Rummel, 
and Koedinger (2009) found that the amount of tutee errors had a 
significant positive correlation with tutor learning, whereas it had 
a significant negative correlation with tutee learning. Third, the 
teachable agent technology facilitates the collection of detailed 
process data showing interactions between the student and the 
agent, which is a major contribution of the current article. 

There have been three major techniques used to build teachable 
agents: (a) Some teachable agents (TAs) solve problems using the 
shared knowledge that students create. Students using such 
knowledge-sharing TAs are often told that they teach the agent by 
directly providing the shared knowledge to the agent. For example, 
students teach Betty’s Brain by drawing a concept map represent- 
ing causal relationships between factors related to river ecology 
(Biswas, Leelawong, Schwartz, Vye, & The Teachable Agents 
Group at Vanderbilt, 2005; Leelawong & Biswas, 2008). (b) 
Another type of TA applies to the knowledge-tracing technique 
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that Cognitive Tutors use to diagnose students’ competency (Rit- 
ter, Anderson, Koedinger, & Corbett, 2007). Such knowledge- 
tracing TAs are equipped with a set of skills to be learned. Some 
of the skills are set to be inactive at the beginning to provide the 
agent with limited competency to solve problems. As the student 
tutors the agent, the model-tracer identifies the skill that was 
tutored and activates the tutored skill so that the agent can apply it 
to future problems. Pareto, Arvemo, Dahl, Haake, and Gulz (2011) 
developed a knowledge-tracing TA for students to learn arithmetic 
concepts. (c) The last type of TA integrates machine-learning 
engines that allow the TA to learn skills dynamically, arguably 
more accurately reflecting the tutor—tutee interaction. As an ex- 
ample of such a knowledge-learning TA, Michie, Paterson, and 
Hayes (1989) developed the Math Concept Learning System with 
an inductive logic programming engine (called ID3) developed by 
Quinlan (1986) to induce rules from examples, which enabled it to 
learn math skills and solve equations. STEP (Simulated, Tutorable 
Physics Student) is another example of the knowledge-learning TA 
in Physics (Ur & VanLehn, 1995). 

SimStudent is an example of a knowledge-learning TA, but 
has several distinctive characteristics compared with other TAs. 
First, SimStudent is one of a few TAs that have been intensively 
used in authentic classroom settings. Other such empirically 
well-validated agents include Betty’s Brain (Biswas, Jeong, 
Kinnebrew, Sulcer, & Roscoe, 2010) and the TA developed by 
Pareto et al. (2011). Second, in contrast to other TAs, which 
have been largely implemented in declarative domains, Sim- 
Student learns algebra content with a focus on procedural 
problem solving. Third, SimStudent is an instance of a TA with 
a humanlike learning capability (Li, Matsuda, Cohen, & Koed- 
inger, 2011; Ohlsson, 2008). SimStudent performs inductive 
learning to interactively generalize examples provided by the 
student. Therefore, a naturalistic tutoring dialogue can occur 
between the student and SimStudent. Fourth, because SimStu- 
dent inductively learns skills from examples, it may learn skills 
incorrectly, depending on the prior knowledge it is given and 
the way the student tutors SimStudent. One such common 
source of incorrect learning stems from ambiguities in exam- 
ples. To the best of our knowledge, SimStudent is the first TA 
that models students’ incorrect learning. Because students gen- 
erally learn both from correct and incorrect examples (Booth & 
Koedinger, 2008), observing a TA learning incorrectly may 
positively impact tutor learning. 


Overview of the Data Analysis 


To connect the outcome and process data to advance cognitive 
and social theories of the tutor-learning effect, data from three in 
vivo classroom studies have been analyzed to address the three 
research questions mentioned in the introduction. To measure 
SimStudent’s learning, we used the process data showing how well 
SimStudent performed on the quiz. To measure students’ learning 
(i.e., tutor learning), we used test scores as the outcome data. The 
correlation between SimStudents’ and students’ learning was an- 
alyzed using these two variables as well. We focused on a number 
of factors in the process data to analyze how and when SimStu- 
dents’ and students’ learning play out. 
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The Learning Environment: APLUS and SimStudent 


Figure 1 shows an example screenshot of APLUS with SimStu- 
dent. The SimStudent avatar is visualized in the lower left corner. 
There have been three versions of SimStudent developed with 
different avatar images, as shown in Figure 2. Different versions of 
SimStudent have different functionalities to address different re- 
search questions as described later. 

The initial version of SimStudent is called Lucy and is repre- 
sented as a single static image (see Figure 2 i). The second version 
of SimStudent is called Stacy (see Figure 2 ii) and is capable of 
three facial expressions, including a thinking pose when SimStu- 
dent commits to learning, a happy expression when a problem is 
solved, and a neutral expression otherwise. The third version of 
SimStudent is called Tomodachi (see Figure 2 iii). Students can 
customize Tomodachi’s avatar by changing the name, hairstyle, 
skin color, eyes, and shirt. Tomodachi is capable of the same three 
facial expressions as Stacy. 


Overview of Tutoring Interaction 


In APLUS, a student interactively tutors SimStudent with the 
following tutoring actions: 

* Pose a problem for SimStudent to solve in the Tutoring 
Interface. In Figure 1a, the student entered “3x — 7” and “11” in the 
first row of the equation table. SimStudent then attempts to solve 
the problem by applying learned productions and asking the stu- 
dent about the correctness of each step. 

¢ Provide flagged (yes/no) feedback to SimStudent that shows 
the student’s judgment on the correctness of SimStudent’s steps. 
When the student provides negative feedback, SimStudent may 
make another attempt. In Figure la, SimStudent entered “18” in 
the second row, and asked whether the student thought it was a 
good move. The student then provided positive feedback. 

¢ Provide help on what to do next. When SimStudent does not 
know what to do, SimStudent asks the student for help. To respond 
to the help request, the student demonstrates the next step in the 
tutoring interface. In Figure la, SimStudent got stuck after enter- 
ing “18.” In response, the student tutors SimStudent by showing it 
a possible next step, in this case entering “divide 3” for the 
transformation of the second row. 

* Quiz SimStudent to gauge learning. Students may have Sim- 
Student take (and retake) the quiz at any time during tutoring (see 
Figure 1c). Further details of the quiz are below. 

There are also resources for students to review learning objec- 
tives in the unit overview and to review problem-solving proce- 
dures by studying worked-out examples. Clicking the different 
[Example] tabs displays complete examples in the Tutoring Inter- 
face. The [Unit Overview] tab provides a brief overview of the 
target unit (i.e., equations with variables on both sides), a model 
solution with elaborated explanations, and suggested problems for 
students to use when tutoring SimStudent. 


Quiz 


In the classroom studies, students were told that their goal was 
to tutor SimStudent well enough so that SimStudent would pass a 
predefined quiz. The quiz has four sections each with two equation 
problems. There is a one-step equation (e.g., 3x = 6), three 
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c) Summary of the quiz results 


LO Oo Quiz Results 





+++ Quiz Section 1 of 4 ++ 


++++ Problem No.1 ++++++++++ 
4q=2 [divide 4] 
q = 1/2 [done] 





++++ Problem No.2 ++++++++++ 
9k+5=7 [subtract 9] 





Stacy: Hmm... 





good move? 
You: Yes 
Stacy: Hmm... 


| Stacy —_ anmeeaese 


Stacy: | entered “18” on the right-hand side. Do you think that would be a 


Stacy: I’m stuck. | don't know what to do next. Please show me what do do. 


cannes 


9k-4 = -2 [add oy” 
9k+5 =7 attr 






++ Score: 
++This section: 1/2 (50%) 
++ Overall: 1/8 (12%) 
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After SimStudent takes the quiz, a 
summary dialog window is shown. 
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underlined in this figure). 


b) SimStudent asking why a step she performed was incorrect 


question by typing a free text in a 
chat box. 





| 
The student answering SimStudent’s | 
| 
| 


Stacy 


Figure 1. 


Stacy: Hmm... 
Stacy: | entered “subtract 7” for the transformation. Do you think that would 


be a good move? ey | 
You: No C ] 
Stacy: Hmm... aie 
Stacy: But | put subtract 5 for 2x+5=9. Why doesn't subtract 7 work now? voi 


ow! ( Submit 


4 


Because you should add 7 





A screenshot of the Study If APLUS, the online gamelike learning environment in which the student 


can interactively tutor SimStudent. Students enter a problem on the first row of the Equation column. SimStudent 
attempts to solve the problem by entering steps (e.g., “3x” and “18” in this case). SimStudent asks the student 
about the correctness of the steps. When SimStudent cannot perform a step correctly, it asks the student for help. 
In this example, the student entered “divide 3” in the second row as a next step after SimStudent entered “18.” 
SimStudent occasionally asks questions (b). In this example, SimStudent is asking for a reason why a step it 
performed is considered to be wrong. Students may have SimStudent take the quiz by clicking on the [Quiz 
Stacy] button. After SimStudent takes the quiz, a summary dialog window is shown (c). APLUS = Artificial 


Peer Learning environment Using SimStudent. 


two-step equations (e.g., —2x + 5 = 11), and four equations with 
variables on both sides (e.g., 3 — 2x = 5x + 7). SimStudent takes 
the quiz section by section, and must correctly solve both problems 
in each section to proceed to the next section. 

After SimStudent takes the quiz, the overall results and correct- 
ness of the steps are displayed in a different window, as shown in 
Figure 1c. An embedded Cognitive Tutor Algebra program (Ritter 
et al., 2007) grades the quiz results. The Cognitive Tutor is 
invisible to students. 


The quiz problems were randomly ordered for Study I, but 
they were ordered on the basis of increasing difficulty level for 
Studies HI and II. The quiz problems were fixed for Studies I 
and II; that is, SimStudent was given the same set of quiz 
problems each time the student administered a quiz. For Study 
II, the quiz problems were generated on the fly while keeping 
the type of problems intact. This means that although the 
numbers and variables letters were changed each time SimStu- 
dent took the quiz, the positive and negative signs were pre- 
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Neutral Thinking Happy 
N/A N/A 
ii) 
Study I 
(Stacy) 
9? 
iti) 
Study II ? 
(Tomodachi) 
Figure 2. SimStudent’s avatar image used in the three studies. There was only one static image used for Study 
I (Lucy). There are three facial expressions for Study II (Stacy): (a) neutral when waiting for the next problem 
to be entered, (b) thinking, and (c) happy after solving a problem. Study III (Tomodachi), which can be 
customized and uniquely named by each student, also has these three expressions. N/A = not applicable. 
Served Fomexamples 301 o)—=—/ and) —2y 4 = 10rare context-specific explanations. We hypothesized that such menu 


considered to be isomorphic equations. 


Self-Explanation 


In Studies II and IL, SimStudent had an ability to occasionally 
ask questions about things that students did, and these questions 
were intended to elicit students’ self-explanations (Matsuda, Co- 
hen, et al., 2012). 

SimStudent’s questions appear in the chat box at the bottom of 
the APLUS interface. SimStudent randomly selects a question 
from among a set of two to three questions that are relevant to each 
of three specific situations: 

1. When the student inputs a new problem in the system, 
SimStudent asks why the student selected that problem or what 
that problem will help it learn. 

2. If the student provides negative feedback on a step that 
SimStudent performed, SimStudent may or may not ask a ques- 
tion. If SimStudent has alternative actions to perform, it will not 
ask for an explanation. In cases in which SimStudent does solicit 
an explanation, it takes the last attempt that was made for the 
particular skill and asks why the step was incorrect, or how the 
situation is different from a previous step on which the same 
operation was used correctly. 

3. SimStudent also asks questions after the student has provided 
a hint about transformation steps, not the results of the transfor- 
mations (as the latter involves arithmetic calculation, and is thus 
often obvious). SimStudent will not ask a question at this point if 
it already asked about the student’s negative feedback on the same 
step. 

The ways students input their response varies depending on the 
type of question. For a question on a demonstrated hint or new 
problem, there is a drop-down menu available with prewritten, 


items would work as examples for students to learn (cf. Aleven & 
Koedinger, 2002). For questions about a demonstrated hint, the 
menu items use terminology such as variable, constant, and coef- 
ficient in a manner that reinforces their meanings. For questions 
about a new problem, the menu items include the key target 
concepts such as “It will help you learn how to deal with variables 
on both sides.” Even when selecting an answer from the drop- 
down menu, students can also edit the selected text with their own 
words. For questions about negative feedback, for example, “Why 
is (x) wrong?” students need to input their own answers. Figure 1b 
shows an example of a student’s response for SimStudent’s ques- 
tion about why “subtract 7” is wrong for first transformation 
(which, by the way, is an example of SimStudent making an error 
that students commonly make). 

SimStudent waits for student input before continuing to the next 
step of the equation. After the student clicks the submit button, the 
answer appears in the chat box below SimStudent’s question. This 
explanation is also logged, but the answer does not affect SimStu- 
dent’s learning. If the student clicks the submit button without 
providing an explanation, the student has essentially ignored Sim- 
Student’s question, and it will move on to the next step. In the 
classroom study, the students were not informed that they could 
skip the questions. 


Overview of SimStudent’s Learning 


The underlying machine-learning paradigm used for SimStudent 
is a technique called programming by demonstration (Lau & Weld, 
1998) that generalizes positive and negative examples to generate 
a set of hypotheses using a given set of background knowledge 
sufficient to interpret (or “explain”) the examples. The positive and 
negative examples are provided by students as feedback and hints, 
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as described in the previous section. Affirmative feedback (i.e., 
“yes”) and hints become positive examples, whereas instances of 
negative feedback (i.e., “no”) become negative examples. 

SimStudent generalizes from these positive and negative exam- 
ples and generates a set of production rules that can reproduce all 
positive examples but no negative examples. Each production 
represents where to focus attention to know when and how to apply 
a particular skill. SimStudent uses hybrid AI techniques to learn 
the where, when, and how parts of a production rule. Providing 
technical details of the learning algorithm is beyond the scope of 
this article, but can be found elsewhere (Matsuda et al., 2007). 

As mentioned earlier, one of the unique characteristics of Sim- 
Student is its ability to learn skills incorrectly. We hypothesize that 
students learn incorrect skills by making inappropriate inductions 
from examples due to inappropriate background knowledge (Mat- 
suda, Lee, Cohen, & Koedinger, 2009). Such incomplete back- 
ground knowledge allows students to rely on shallow problem- 
solving features instead of deep domain principles. 

As an example, suppose that a student is about to generalize an 
example of “subtracting 3 from both sides of 2x + 3 = 5.” The 
student may recognize “+” in the left-hand side as the arithmetic 
operator instead of the sign of a term. As a consequence, the 
student may generalize this example to “subtract a number that 
follows an operator.” Students who perceive such a shallow feature 
would also be likely to subtract 4 from both sides of 3x — 4 = 6 
as well, which is one of the most frequently observed student 
errors (Booth & Koedinger, 2008). 

To model this type of incorrect learning, we “weakened” 
SimStudent’s background knowledge by dropping the concept 
of an algebraic term in an expression and adding more percep- 
tually grounded background knowledge, such as “get a number 
after an arithmetic operator.” In a prior study (Matsuda et al., 
2009), we validated the cognitive fidelity of SimStudent’s 
learning by comparing SimStudent’s and human students’ 
learning. The study showed that SimStudent with “weak” prior 
knowledge learned skills incorrectly in a humanlike manner and 
generated humanlike errors when solving problems using the 
learned productions. 

SimStudent applies learned productions to solve problems posed 
by a student, but the productions are not visible to the student. 
Therefore, the cognitive fidelity mentioned above could better 
facilitate tutor learning, because the student must identify, under- 
stand, and remediate SimStudent’s errors, which evoke or foster 
metacognitive tutoring skills and a deep understanding of the 
domain knowledge. 


Method 


Classroom Studies and Data Collection 


The three in vivo studies were conducted as controlled random- 
ized trials under the direct supervision of the Pittsburgh Science of 
Learning Center (LearnLab.org). Each study was conducted as a 
part of regular algebra classes. The studies used the same general 
format that involved 5 (Study I) or 6 (Studies II and III) days in the 
classroom. On the first day, all students took a pretest using an 
online test form (as described in the Measures section). After 
taking the pretest, students were randomly split into two groups 
and studied algebra equations using the assigned material for two 
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(Study I) or three (Studies II and III) class periods (one class 
period per day). All students then took an online posttest on the 
following day. Finally, all students took an online delayed test 2 
weeks after the posttest. : 

Study I: Initial classroom trial. The primary goal of Study I 
was to evaluate the effectiveness of SimStudent (Matsuda et al., 
2011). The version of APLUS and SimStudent used in Study I 
behaved exactly as described in the previous section and is called 
Baseline hereafter. Algebra I Cognitive Tutor (Ritter et al., 2007) 
was used for the control condition. 

Study II: Self-explanation effect. In Study II, we focused on 
the self-explanation hypothesis, which conjectures that the tutor- 
learning effect is facilitated when the students are asked to explain 
and justify their tutoring decisions (Matsuda, Cohen, et al., 2012). 
To test this hypothesis, we compared SimStudent that did (the 
self-explanation condition) and did not (the baseline condition) ask 
questions. 

Study III: Game show effect. For Study III, we compared the 
effect of learning by teaching SimStudent in APLUS with and 
without a Game Show feature (Matsuda, Yarzebinski, et al., 2012). 
In the Game Show, a pair of SimStudents, each tutored by a 
different student, compete by solving problems posed by the 
students who tutored them. This study was conducted to test the 
motivation hypothesis that conjectures that the more students are 
engaged in tutoring, the more tutor learning would be facilitated. 
The students in the Game Show condition were told to obtain the 
highest score in the Game Show, instead of having Tomodachi 
pass the quiz, which was the goal for the students in the non-Game 
Show condition. 

Because the scope of this article does not include the motivation 
hypothesis, we do not discuss details of Study III here. However, 
we include Study III in the following analysis, because the control 
condition of Study III used the same version of SimStudent that 
Study II used for the self-explanation condition. Namely, the Study 
III SimStudent occasionally prompted students for explanations 
and justifications. 


Participants 


There were two schools involved in Study I. One school had 30 
Algebra I (Grade 8) and 34 Algebra II (Grade 9) students, and the 
other school had 40 Algebra I (Grade 8) students. Study II in- 
volved one school with 160 Algebra I students in Grades 8, 9, and 
10. Study III was conducted at the same school as Study II, and 
141 Algebra I students in Grades 7 and 8 participated in Study II. 
To avoid a confounding factor of familiarity with the study, we 
excluded the ninth- and 10-grade students who were likely to have 
been included in Study II. 

There were a significant number of absentees in each study. For 
the analysis in the following sections, we included only students 
who took all three (pre, post, and delayed) tests and participated in 
all classroom sessions. As a consequence, the following analyses 
contain 33 (32%), 81 (51%), and 69 (49%) of students for Study I, 
II, and III, respectively. 


Measures 


Outcome of tutee learning. To quantify tutee learning (.e., 
SimStudent’s achievement), we use the number of quiz sections 


1158 


that SimStudent passed, which differed in format among the three 
studies. 

Outcome of tutor learning. Students’ learning was measured 
with online tests that consisted of two parts—the Procedural Skill 
Test (PST) and the Conceptual Knowledge Test (CKT). The tests 
had three isomorphic versions that were counterbalanced for pre-, 
post-, and delayed tests. Two test items were considered isomor- 
phic when they were of identical type, but included different letters 
and numbers. Equations were carefully varied so that two isomor- 
phic equations shared the same properties in their solutions (e.g., 
whole number vs. fraction). 

The PST had three types of test items: (a) six equation-solving 
items. Students were asked to show their work on a piece of paper; 
(b) twelve agree/disagree items to determine whether a given 
operation was a logical next step for a given equation; (c) five 
worked-out items to identify the incorrect step in a given incorrect 
solution (multiple choice) and explain why (free response). The 
CKT had two types of test items: (d) thirty-eight true/false items 
asking about basic algebra vocabulary to identify constants, vari- 
ables, and like terms; (e) ten true/false items to determine whether 
two given expressions are equivalent. 

For Studies II and III, the following changes were made on the 
online test: (a) Four additional one-step equations were added to 
the equation-solving items. (b) A “Not Sure” option was added for 
multiple-choice items to lower the chance of students making 
random guesses. Students were told that they would lose a point 
for an incorrect answer for multiple-choice questions, but there 
was no penalty for selecting “Not Sure.” 

The test items were graded as follows. For the equation-solving 
items, students received a score of | if their answer was correct and 
partial credit based on their written work if their answer was 
incorrect. For the multiple-choice items, students received a score 
of 1 for a correct answer, 0 for “Not Sure,” and —1 for an incorrect 
answer. 


Cognitive and Social Factors of Interest 


APLUS automatically collects detailed data showing the inter- 
action between students and SimStudent with additional narratives 
such as the response correctness. In the current analysis, we focus 
on the following variables: 

1. The accuracy of students’ feedback and hints. The accuracy of 
response is an aggregation of feedback and hints. 

2. The likelihood of responding to SimStudent’s hint request, 
which is the ratio of hints provided by a student to the total number 
of hints requested by SimStudent. Although students must answer 
SimStudent’s hint request to proceed to the next step, they some- 
times avoided answering by starting a new problem or giving a 
quiz. 

3. The frequency of self-explanations submitted by students 
during tutoring. 

4. The type of problems tutored. Although students were ex- 
plicitly told that SimStudent must be able to solve equations with 
variables on both sides to pass the quiz, they needed to start with 
easier types to work up to the target difficulty. 

5. The degree of repetition in selecting problems for tutoring. 
Students in our studies often used quiz problems during tutoring. 
As mentioned before, the problems in the quiz were fixed for 
Study II, but only the type of problems was fixed for Study III. To 
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avoid confusion, we shall use the term problem to mean the exact 
same problem for Study II and the same type of problem for Study 
Ill. The problem repetition ratio is then the ratio of the number of 
problems tutored more than once to the total number of problems 
tutored. 

6. Time on task. The amount of time students spent tutoring 
problems and giving explanations to SimStudent. This time does 
not include the quiz or the resource usage. 

7. Tutor’s prior knowledge, that is, each student’s PST and CKT 
pretest scores. 

8. Tutee’s learning outcome, that is, the number of quiz sections 
that SimStudent passed. 


Results 


This section is organized to answer the three major research 
questions mentioned in the introduction. We first show results 
about SimStudent’s and students’ learning outcomes addressing 
the first research question. We then show the correlations between 
tutor and tutee learning that answers the second research question. 
Finally, we show major findings obtained from the process data 
showing the cognitive and social factors that have significant 
influence on tutor and tutee learning. 


Learning Outcomes 


Because the three studies were conducted at different schools in 
different years, we first tested whether there was any population 
difference among the three studies. A one-way analysis of variance 
(ANOVA) was conducted with the independent variable of study 
(I, Il, If) and the dependent variable of pretest score aggregated 
across two conditions. For both PST and CKT, the mean pretest 
score for Study I was significantly higher than Study II, which was 
significantly higher than Study III; for PST, F(2, 180) = 23.58, 
p < .001; for CKT, F(2, 180) = 44.81, p < .001. The difference 
in the pretest scores might reflect the age difference between the 
studies. Study III had the youngest student population. 

Tutee-learning outcome: Performance on the quiz. Figure 
3 shows the number of students whose SimStudent passed the quiz 
during the intervention. In Study I, none of the 18 students in the 
SimStudent condition managed to get their SimStudent to pass all 
four sections of the quiz. Only five students managed to pass quiz 
Section 1, and of those five, only one student passed quiz Section 
3. For Study I, 36 out of 81 students managed to pass all four 
sections of the quiz within the allotted 3 days. Nearly all students 
(78 out of 81, i.e., 96%) passed at least Section 1. To our surprise, 
in the Study III baseline condition, we again observed that none of 
the students managed to have their SimStudent pass the quiz. Only 
22 out of 40 (55%) students passed quiz Section 1. As mentioned 
earlier, Study II involved younger students and showed lower 
pretest score than the other two studies. The students in Study III 
might have less prepared for tutoring. 

Tutor-learning outcome: Test scores. A summary of the test 
scores is shown in Table 1. The table shows mean scores for the 
pre-, post-, and delayed tests for all three studies. No condition 
difference on the pretest was found both for the PST and the CKT 
across the three studies. We thus conducted a 2 X 3 repeated 
measures ANOVA, with condition (study vs. control) as a 
between-subjects variable and test time (pre, post, and delayed) as 
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Figure 3. Number of students whose SimStudent passed the quiz. I, II, 
and III represents each study. Study I did not have a Day 3. The quiz had 
four sections with two equation problems per section, which were ran- 
domly ordered for Study I and ordered on the basis of difficulty level for 
Studies IT and III. The quiz problems were fixed for Studies I and II. For 
Study III, the quiz problems were generated on the fly while keeping the 
type of problems intact. . 


a within-subjects repeated variable. The ANOVA was run on the 
PST and the CKT separately for each study. 

For the PST, there was neither a main effect of condition nor 
an interaction effect between test time and condition. There 
was, however, an iterative enhancement of the effect of tutor 
learning indicated as the gain on the test scores. In Study I, the 
impact of test time was absent. In Study II, however, there was 
a main effect of test time, F(2, 78) = 34.85, p < .001. A further 
analysis showed that students’ average test scores were signif- 
icantly higher in the delayed test (M = .68, SD = .27) than both 
the pretest (M = .54, SD = .26; p < .001, d = 0.53) and the 
posttest (MV = .57, SD = .31; p < .001, d = 0.38). The 
difference between the pre- and posttests was statistically in- 
distinguishable. The reason for the higher delayed test scores in 
Study II might not be completely due to the study intervention. 
There were algebra classes between the post- and the delayed 
test (2 weeks apart) in which regular teachers continued teach- 


ing equation solving. For Study III, again, there was a main 
effect of test time, F(2, 66) = 8.81, p < .001. Both posttest (M 
= .45, SD = .20; p < .01, d = .35) and delayed test (M = .46, 
SD = .23; p < .001, d = .37) were significantly higher than 
pretest (M = .38, SD = .20). The difference between the post- 
and delayed test was not statistically significant. 

For the CKT, there was neither a main effect of test time nor a 
main effect of condition for all three studies. 


Correlation Between Tutee and Tutor Learning 


In Study III, tutee learning (the number of quiz sections that 
SimStudent passed) has a significant correlation with tutor learn- 
ing (the normalized gain on the PST); r(39) = .37, p < .05. There 
was no significant correlation between tutee and tutor learning for 
Studies I and II. 


Cognitive and Social Factors for Tutee and Tutor 
Learning 


What affected tutee learning? It is surprising to observe that 
so many students failed to sufficiently tutor SimStudent to pass the 
quiz. To understand why, we conducted comparative analyses by 
splitting students into two groups based on the median quiz prog- 
ress. For this analysis, we included students from both conditions 
in Study II (V = 81) and those in the control condition in Study II, 
in which the goal of tutoring was to have SimStudent pass the quiz 
(N = 40). Study I students were excluded from this analysis, 
because the order of quiz items in Study I was not compatible with 
Studies I and Il. 

Students were split into the successful group and the unsuc- 
cessful group using the median of the quiz section passed. For 
Study II, the split occurred at Section 3 (successful n = 44 vs. 
unsuccessful n = 37), whereas for Study III, the split occurred 
at Section 1 (successful n = 21 vs. unsuccessful n = 18). 
Within each study, we compared the two groups for a number 
of factors using independent samples ¢ tests. Table 2 shows the 
results of this analysis. 





Table 1 
Test Scores Summary 
Study Pre-test 
I 
CogTutor Baseline CogTutor 
P .67 (.24) .74 (.19) .73 (.20) 
GC OGL) .62 (.15) .57 (.10) 
I 
Baseline SelfExpl Baseline 
P .54 (.27) .52 (.26) (P)) 
€ 29 (.23) .29 (.27) .32 (.24) 
Il 
SelfExpl Game Show SelfExpl 
P .34 (.19) 44 (.19) 41 (.21) 
c .13 (.20) .19 (.20) 7167(.19) 


Post-test Delayed-test 
Baseline CogTutor Baseline 
.76 (.21) .65 (.25) POR Gale) 
.63 (.16) .52 (.19) 58 (.13) 
SelfExpl Baseline SelfExpl 
57 (.33) .68 (.25) .68 (.30) 
OMG2y) 30 (.25) .35 (.26) 
Game Show SelfExpl Game Show 
49 (.19) 43 (.23) .50 (.23) 
.21 (.20) ally (GIG) .22 (.20) 


eee ee ee 
Note. CogTutor = cognitive tutor; Baseline = the baseline Artifical Peer Learning environment Using SimStudent (APLUS) and SimStudent; P = 
Procedural Skill Test; C = Conceptual Knowledge Test; SelfExp] = APLUS and SimStudent with self-explanation prompt; Game Show = APLUS and 
SimStudent with the Game Show feature. Each cell shows the mean, with the standard deviation in parentheses. 
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Table 2 
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Comparison Between Successful and Unsuccessful Groups Based on a Median Split for Number of Quiz Sections Passed 


Quiz performance 


Factor Study Successful 
Procedural Normalized Gain II “NEY GP) 
il PPM PA) 
Correct Feedback I .86 (.07) 
Ul .78 (.06) 
Correct Hint Il TKS G IS) 
il 56 (.13) 
Disregarding Hint Requests I 44 (.16) 
Il .16 (.09) 
Repeating Problem Element II asi) 
Il 38 (.12) 


Note. Standard deviations appear in parentheses. 


*Levene’s test indicated unequal variances, (F = 10.07, p < .01); df adjusted from 38 to 21. 


Unsuccessful t Cohen’s d df 
.03 (.51) —1.01 0.23 Tie 
—.01 (.29) —2.80™" 0.92 37 
.83 (.08) = ley? 0.34 79 
.68 (.14) 2.80 1.21 De 
.63 (.15) oon 0.83 79 
30 (.25) 3,05." 5 1.61 24° 
52 (.20) 1.86 0.41 79 
.29 (.16) 3.14" 1.03 Bi 
.23 (.19) 2.86" 0.72 62° 


.46 (.19) Le 0.57 38 


> Levene’s test indicated unequal variances, (F = 12.68, 


p < .001); df adjusted from 38 to 24. ‘° Levene’s test indicated unequal variances, (F = 9.48, p < .01); df adjusted from 79 to 62. 


pele 
on the basis of a z-score outlier analysis of +3. 


First, the correctness of the student’s feedback and hints had a 
notable influence on SimStudent’s learning. For Study II, students 
in the successful group provided correct hints more often than 
students in the unsuccessful group. There was, however, no group 
difference in the accuracy of the feedback provided. For Study III, 
students in the successful group provided both correct hints and 
accurate feedback more often than students in the unsuccessful 
group. 

Second, the likelihood of responding to SimStudent’s hint 
request also has a notable difference. For Study III, the suc- 
cessful students responded to hint requests more often than 
unsuccessful students. The likelihood is, however, not signifi- 
cantly different between successful and unsuccessful groups for 
Study I. 

Third, the problem repetition ratio was different. For Study II, 
the successful group tended to repeat the exact same problem less 
often than the unsuccessful group. For Study III, however, the 
difference was only marginal. 

What affected tutor learning? In this analysis, we use the 
normalized gain of the PST from the pre- to the posttest as the 
measurement for the tutor learning. This analysis includes the 
same student data used in the tutee-learning analysis mentioned 
in the previous section. 

First, the more the target problems were tutored (i.e., equations 
with variables on both sides), the more the students learned. This 
correlation was observed in both Study II, 7(79) = .26, p < .05, 
and Study III, r(39) = .36, p < .05. 

Second, the more the students gave self-explanations on the 
target problems, the more the students learned. Again, this corre- 
lation was observed in both Study II, 7(38) = .32, p < .05, and 
Study II, 7(37) = .33, p < .05. 

Third, the more the students provided a correct tutoring re- 
sponses (a combination of feedback and hint), the more the stu- 
dents learned, although this correlation was observed only in Study 
Ill, r38) = .31, p < .05. 

To our surprise, there was no correlation between the time on 
task and tutor learning in both studies: Study I, 7(81) = .09, p = 
40; Study III, r(40) = .08, p = .61. 


“> < 001. * Maximum df is 79 for Study II and 38 for Study III. Numbers marked with this symbol had 79—N or 38—N cases removed 


Impact of Prior Knowledge for Tutor and 
Tutee Learning 


We first show the impact of the tutee’s prior knowledge on tutor 
learning. SimStudents for Study II (Stacy) and Study Il (Tomo- 
dachi) were equally pretrained on more one-step equations than the 
SimStudent in Study I (Lucy). An independent samples ¢ test 
confirmed that both Stacy and Tomodachi performed better on the 
first three tutoring problems than Lucy. 

To see how the tutee’s performance affected the tutor’s 
performance, students’ response accuracy was computed as a 
ratio of correct responses (i.e., feedback or hint) to all responses 
for each step in the first three tutored problems. On average, 
students in Studies II and III gave accurate responses more 
often than Study I students. Students in Studies II and Il 
showed an average response accuracy of .76 (SD = .21), 
whereas students in Study I showed an average response accu- 
racy of .57 (SD = .26). The difference is statistically signifi- 
cant, (134) = —3.49, p < .001, d = 0.60. 

One possible explanation for students’ higher response accuracy 
in Studies II and II] is that it is easier to recognize correct steps as 
correct than to identify incorrect steps as incorrect. Because Stacy 
and Tomodachi performed more steps correctly, the students in 
Studies II and III were able to correctly provide positive feedback 
more easily. When aggregated across all three studies, SimStu- 
dent’s performance accuracy and students’ response accuracy were 
actually highly correlated, r(135) = .69, p < .001. 

Next, we analyzed the impact of the tutor’s prior knowledge on 
tutor learning. The PST and CKT pretest were both predictive of 
students’ posttest scores on the PST. A regression analysis with 
PST and CKT pretest scores as independent variables and the PST 
posttest score as a dependent variable revealed the following 
regression coefficients: PST_Post = .70 <x PST_Pre + .12 X 
CKT_Pre + .17. An identical analysis was also conducted for the 
CKT posttest; a regression analysis with the PST and CKT pretest 
scores as independent variables and the CKT posttest score as a 
dependent variable revealed the following regression coefficients: 
CKTSPost'=".25" PST! Pre 2850 CK T Pre’ #105: 
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Discussion 


Results from the experiment provide four sets of important 
information to understand tutor learning. First, our data show that 
learning by teaching SimStudent is effective for learning proce- 
dural skills measured by the PST, as shown in Study III, but not for 
learning conceptual knowledge measured by the CKT. 

Second, there is a significant correlation between tutee and tutor 
learning. Students tended to learn more when they tutored Sim- 
Student correctly (i.e., with an accurate response) and appropri- 
ately (i.e., on appropriate problems with a sufficient amount of 
explanations). 

Third, there were some notable differences in the way that the 
successful and the unsuccessful groups tutored SimStudent. Stu- 
dents in the unsuccessful group had trouble teaching SimStudent 
well, perhaps without even recognizing that they were not teaching 
appropriately. This manifested itself in students making many of 
the same errors, not properly responding to SimStudent’s hint 
requests, and repeatedly teaching the same problem. 

Fourth, both tutee and tutor’s prior knowledge affected tutor 
learning. When the tutee had higher prior knowledge, the tutor 
tended to respond more accurately, which was further correlated 
with tutor learning. Our data also showed, however, that the tutor’s 
prior competence both on conceptual and procedural knowledge 
was strongly predictive of tutor learning. 

Finally, both SimStudent and APLUS have been iteratively 
improved from Study I to Study III, which may explain the gradual 
enhancement of the outcome. There was a population difference in 
the pretest score. Both for the PST and the CKT, students in Study 
I scored higher than the students in Study II, who outperformed the 
students in Study III. Yet, only Study III showed a significant gain 
in PST scores from pre- to posttest. 


Tutor Help 


Our findings show that learning by teaching does not happen 
automatically. Students need help to tutor SimStudent correctly 
and appropriately. Other research has also pointed out that students 
often do not correctly recognize their own misunderstandings 
(King, 1998). In our studies, students often unknowingly made 
inappropriate tutoring decisions and provided incorrect feedback 
and hints, which affected SimStudent’s learning. These behaviors 
were negatively correlated with tutor learning. 

One idea to provide such tutor help is to integrate a third agent 
(a meta-tutor) into the APLUS environment, a commonly used 
idea in the context of multiagent learning systems (e.g., Biswas et 
al., 2005; Vassileva, McCalla, & Greer, 2003). The meta-tutor 
oversees students’ tutoring activities and provides them with just- 
in-time scaffolding. 

The meta-tutor could provide students with both cognitive help 
regarding domain knowledge about how to solve problems and 
metacognitive help regarding proper tutoring methods. Some stud- 
ies show that tutors can be trained to be a better tutor, which 
facilitates tutor learning (Ismail & Alexander, 2005; King et al., 
1998). Other studies show the effect of the tutor help (Biswas et 
al., 2010; Walker et al., 2009), but none of them have explored the 
differing effects of cognitive and metacognitive help. It is therefore 
important to study how to implement cognitive and metacognitive 
help, how they foster tutor learning, and how well students learn 
tutoring skills from these different types of interactive support. 
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Another possibility is for the teachable agent itself to provide 
tutor help. For example, if a student poses the same problem (or 
the same type of problem) multiple times, then SimStudent could 
alert the tutor. The difference in the source of tutor help would 
have different social and affective impacts on the student. This 
might become particularly subtle when the student has established 
a different rapport with SimStudent and the meta-tutor. Studying 
the social factors of tutor help would therefore be important (Ogan, 
Finkelstein, Mayfield, D’ Adamo, Matsuda, & Cassell, 2012). 


The Effect of Self-Explanation for Tutor Learning 


The current data show that the tutor-learning effect in APLUS is 
limited to procedural skills. Further studies will be needed to 
investigate the tutor-learning effect on conceptual knowledge. We 
hypothesized that self-explanations would facilitate learning con- 
ceptual knowledge, because good explanations contain conceptual 
justifications for algebraic operations. However, the students 
sometimes provided shallow responses (e.g., “Because you didn’t 
add right’) or irrelevant responses (e.g., “Because I just did”). The 
current version of SimStudent does not parse students’ responses; 
instead, it simply proceeds to the next step. Empirical studies show 
that the tutee’s questions have substantial influence on tutor learn- 
ing (Roscoe & Chi, 2004). Thus, if SimStudent requested elabo- 
ration or further reflection on a given response, it may facilitate 
tutor learning. This kind of question is called a reflective 
knowledge-building question, and its effect has been well re- 
searched (Roscoe & Chi, 2007). Building such an intelligent 
teachable agent is therefore an important direction for future 
research (Carlson, Keiser, Matsuda, Koedinger, & Rose, 2012). 


Learning by Teaching Versus Cognitive Tutoring 


Our data show similarities and differences between learning by 
teaching and learning by cognitive tutoring (i.e., more direct in- 
struction). The effect of self-explanation, for example, was evident 
for both styles of learning. Possession of prerequisite knowledge 
also has a notable influence on both styles of learning (Booth & 
Koedinger, 2008). 

A notable difference between the two learning styles is the 
degree to which students can practice metacognitive skills. Our 
data show that the accuracy of tutoring responses, the frequency of 
self-explanations, and the type of problems tutored all positively 
correlate with tutor learning. To achieve successful learning, stu- 
dents must simultaneously monitor both their tutee’s performances 
and their own. This double-edged monitoring requires more com- 
plicated metacognitive skills than solving problems alone in the 
context of cognitive tutoring. 

There is also a difference in the timing of feedback. The feed- 
back in the context of APLUS, that is, the system’s reaction to the 
correctness of the student’s tutoring activities, is delayed. The 
current version of APLUS does not provide students with any 
explicit feedback on their tutoring activities. Students later notice 
when they have made mistakes by reviewing the quiz summary or 
by observing SimStudent’s undesired behaviors during tutoring. 
As an example of the second mistake, even when a student 
incorrectly demonstrated “subtract 4” for “3x — 4 = 10” with a 
correct intention to isolate the “3x” on the left-hand side, SimStu- 
dent might correctly suggest entering “3x — 8” for the left-hand 


1162 


side of the new equation, instead of “3x,” which is what the student 
expected to see. 

The gap between the student’s expectations and SimStudent’s 
actual performance might motivate students to reflect on their 
tutoring actions. This is a kind of “intelligent novice” model of 
desired performance (Mathan & Koedinger, 2005) for tutor learn- 
ing. Embedding the above-mentioned tutor help into the model of 
desired tutor performance might thus facilitate tutor learning. 
Observing the tutee’s performance is a distinctive form of learning 
from correct and incorrect examples available in learning by 
teaching. 


Conclusion 


Students learn by teaching others. Our data show that students 
learn by teaching primarily when they teach the target skills 
correctly and appropriately. The accuracy of students’ responses 
(i.e., feedback and hints), the quality of students’ explanations 
during tutoring, and the appropriateness of tutoring strategy (i.e., 
problem selection) all affected SimStudent’s learning outcome, 
which further affected students’ learning. 

Students’ prior knowledge has a strong influence on tutor learn- 
ing. If students are not well prepared to tutor, the benefits of tutor 
learning might be reduced. Alternatively, once students become 
domain experts and can solve problems fluently (hence become 
better teachers), the benefit of tutor learning might also decline. 
Tutor learning is essentially a paradoxical phenomenon whose 
mechanisms have yet to be fully elucidated. 

Students make errors when teaching and get stuck when pro- 
viding hints, both of which are detrimental for tutor learning. 
Providing more tutor help in the form of cognitive and/or meta- 
cognitive support may be critical to optimizing tutor learning. 

The competence of the tutee also affects tutor learning as well. 
Carefully designing SimStudent’s learning ability and adaptively 
assigning an optimized SimStudent on the basis of the student’s 
competency would further provide us with insight into successful 
learning by teaching. 
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Gendered Socialization With an Embodied Agent: Creating a Social and 
Affable Mathematics Learning Environment for Middle-Grade Females 


Yanghee Kim Jae Hoon Lim 


Utah State University 


This study examined whether or not embodied-agent-based learning would help middle-grade 
females have more positive mathematics learning experiences. The study used an explanatory mixed 
methods research design. First, a classroom-based experiment was conducted with one hundred 
twenty 9th graders learning introductory algebra (53% male and 47% female; 51% Caucasian and 
49% Latino). The results revealed that learner gender was a significant factor in the learners’ 
evaluations of their agent (n* = .07), the learners’ task-specific attitudes (mn? = .05), and their 
task-specific self-efficacy (n* = .06). In-depth interviews were then conducted with 22 students 
selected from the experiment participants. The interviews revealed that Latina and Caucasian 
females built a different type of relationship with their agent and reported more positive learning 
experiences as compared with Caucasian males. The females’ favorable view of the agent-based 
learning was largely influenced by their everyday classroom experiences, implying that students’ 
learning experience in real and virtual spaces was interconnected. 


Keywords: embodied agents, interactive learning environments, equity in mathematics education, 
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A recent analysis of the National Assessment of Educational 
Progress data reported that the achievement gap between Cau- 
casians and ethnic minority students (e.g., African Americans 
and Latinos) in mathematics achievement has become stagnant 
during the last two decades (Vanneman, Hamilton, Anderson, & 
Rahman, 2009). Female students, despite their improved 
achievement in mathematics (Lindberg, Hyde, Petersen, & 
Linn, 2010), still report lower interest and lower self- 
confidence in mathematics as compared with males (Jacobs, 
Davis-Kean, Bleeker, Eccles, & Malanchuk, 2005). These un- 
derrepresented groups of students often “disidentify” them- 
selves with mathematics learning (Steele, Spencer, & Aronson, 
2002) and, as a result, are more likely to avoid taking advanced 
mathematics classes (Steffens, Jelenec, & Noack, 2010). Ac- 
knowledging the urgency in resolving these problems, the Na- 
tional Science Board (2010) has declared its commitment to 
equity and diversity as a focal area for developing the next 
generation of science, technology, engineering, and mathemat- 
ics (STEM) innovators. 

A variety of social, cultural, and economic factors might lead 
to the equity issues. However, gender and ethnic inequity in 
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mathematics education is often attributed to the unsupportive 
learning context in schools (Moody, 2004) and undesirable 
social influences such as stereotyping (Steele et al., 2002). 
Females and ethnic minorities often lack the instructional sup- 
port that might motivate them to engage and succeed in the area. 
This lack of support, coupled with social stereotyping, leads 
them to hold a negative view of mathematics and to doubt their 
capability to succeed. 

Reshaping the school context and social influences might be 
a long societal process, requiring synergistic endeavors by a 
multitude of individuals and institutions. Nonetheless, ad- 
vanced learning technology might design supportive learning 
contexts that help close these motivational and achievement 
gaps. One such technology that uses animated digital characters 
(called embodied interface agents) promises to augment the 
bandwidth of a learner’s interactions with computers (Bailenson 
et al., 2008) and to add social richness to the interactions 
(lacobelli & Cassell, 2007). Many females’ and ethnic minor- 
ities’ learning styles favor active and multifaceted interactions 
(Sciarra & Seirup, 2008); connectedness and relationships are 
characteristic of their learning process (Crosnoe et al., 2010). If 
designed carefully, agent-based learning might be able to create 
a favorable learning context for these students, accommodating 
their learning styles and characteristics. 

This study was conducted to examine this expectation that the 
females’ and minorities’ affect and learning would improve in a 
more social and affable agent-based environment. The study, con- 
sisting of a classroom experiment and following in-depth inter- 
views, investigated how middle-grade students learning introduc- 
tory algebra would react to an agent and whether the reactions 
would differ by the students’ gender and ethnicity. 


GENDERED SOCIALIZATION WITH AN EMBODIED AGENT 


Theoretical Background 


Sociocultural Aspect of Mathematics Learning 


The sociocultural context of learning plays a significant role in 
shaping students’ motivation, learning behaviors, and academic 
outcomes in schools. The learning process is not merely a cogni- 
tive restructuring within an individual mind. It is a social and 
cultural process in which multiple facets of human development 
(e.g., identity and emotion) are intertwined with social, cultural, 
and historical forces (Nasir, Rosebery, Warren, & Lee, 2006). For 
example, a student’s “sense of belonging” in school positively 
correlates with her strong and clear identification with the goal of 
schooling, which ultimately leads her to full, active participation in 
all aspects of the learning process (Freeman, Anderman, & Jensen, 
2007). 

The mathematics learning context and its social and instruc- 
tional dynamics play a critical role in motivating all students to 
learn and excel in mathematics. However, the context and dynam- 
ics seem to have even more critical influence on traditionally 
underrepresented groups of students (Geist, 2010). Feminist schol- 
ars argue that females’ unique way of learning is not best sup- 
ported by the traditional mathematics classrooms (Boaler, 2002). 
Females are “connected knowers” and tend to rely on interpersonal 
relationships and commonality of experience when they approach 
a new idea or knowledge (Belenky, Clinchy, Golberger, & Tarule, 
1997). Supportive relationships with instructional authority and 
peers might be critical for many females’ intellectual pursuit of 
mathematics and perseverance in the area (Crosnoe, Riegle- 
Crumb, Field, Frank, & Muller, 2008). However, the mathematics 
education community has a long tradition that views mathematics 
learning as a depersonalized activity disconnected from other 
aspects of students’ everyday lives (Cobb & Yackel, 1998). This 
assumption about mathematics learning disregards the typical style 
of female learning. Not surprisingly, many females experience 
higher anxiety and discomfort in mathematics classes than boys 
(Geist, 2010). These females report lower interest and self-efficacy 
even when their performances are equal to or better than boys’ 
during early school years (Lindberg et al., 2010). As a result, the 
females avoid taking advanced mathematics courses in high school 
(Steffens et al., 2010). 

A similar phenomenon is observed among many Latino stu- 
dents. Three types of engagement influence Latinos’ achievement 
in mathematics: cognitive, emotional, and behavioral. Latinos 
show a higher level of engagement in mathematics when they are 
asked to work with peers than when asked to work alone (Uekawa, 
Borman, & Lee, 2007). They are more likely to use a participatory 
communication style, which requires active response from the 
audience, such as verbal encouragement or even physical move- 
ment during speech (Gay, 2000). This form of communication is 
not readily accepted in conventional mathematics classrooms. 
Rather, it is often viewed as a disruptive behavior or, at best, an 
attitude less effective for learning (Neal, McCray, Webb-Johnson, 
& Bridgest, 2003). Not surprisingly, Latino students experience a 
higher level of mathematics anxiety than Caucasian students; 
Latinas’ anxiety tends to be even worse than that of their male 
counterparts (Willig, Harnisch, Hill, & Maehr, 1983). 
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Embodied Agents to Create a Social and Affable 
Context 


Although computers are often regarded merely as a tool to 
perform tasks, computer users actually tend to expect computers to 
be like social entities (Lee & Nass, 2003). In response to animated 
digital characters, users build humanlike relationships with the 
character (Bickmore, 2003); college students expect a digital char- 
acter acting as a tutor to have a nice personality as well as content 
expertise (Kim, 2007). Furthermore, just as girls’ and boys’ pref- 
erences for instructional content, activities, and methods are dif- 
ferentiated in classrooms, so are their reactions to the features in 
computer-based learning (Kinzie & Joseph, 2008). Females’ incli- 
nations toward interactions and relationship building in classrooms 
are consistently demonstrated in computer-based environments. 
For instance, girls like interactive and dynamic hints from the 
computer more than do boys (Arroyo, Murray, Woolf, & Beal, 
2003). 

Researchers in educational technology have explored the use of 
embodied interface agents in various theoretical and practical 
frameworks, for example, in the framework of computer-supported 
collaboration (White, Shimoda, & Frederiksen, 1999), or as a way 
to render a sense of social presence (Graesser, Chipman, Haynes, 
& Olney, 2005; Moreno & Flowerday, 2006). Embodied agents 
even seem to play a persuasive role in shaping viewpoints, atti- 
tudes, and behaviors. One experiment revealed that an agent’s 
pedagogical perspectives were successfully projected into college 
students’ own pedagogical perspectives. Preservice teachers who 
worked with an agent who took a constructivist perspective ad- 
opted the constructive perspective after their interactions with the 
agent, whereas those who worked with an agent taking an objec- 
tivist perspective adopted the objectivist perspective (Baylor, 
2002). In another study, middle-school students who had received 
instructions from an agent reported lower levels of perceived 
difficulty than did the students who had received textual informa- 
tion without an agent (Atkinson, 2002). Also, when kindergarten 
children played with the virtual peer Sam, they listened to Sam’s 
stories very carefully and, afterward, mimicked Sam’s linguistic 
styles (Ryokai, Vaucelle, & Cassell, 2003). 

Traditionally, human one-on-one tutoring has been considered 
the best form of instruction because it increased learning by two 
standard deviations as compared with the group instruction in a 
classroom (Bloom, 1984). Researchers in computer-based tutoring 
have strived to approximate the effect of human tutoring. As a 
result, successful tutoring systems were able to increase learning 
by one standard deviation higher than the control groups (Graesser 
et al., 2005; Koedinger & Anderson, 1997). This success has raised 
the inquiry into how we can further make up the missing one 
standard deviation effect through our design. Stone (1998) identi- 
fied three components of scaffolding—perceptual, cognitive, and 
affective—as necessary for effective learning and motivation. 
Many conventional tutoring environments, however, have focused 
on assisting learners in only the cognitive processes of learning. 
They often neglected to implement perceptual and affective as- 
pects of scaffolding. Recently, researchers in educational technol- 
ogy have come to better understand the integral role of human 
cognition and affect in the learning process, and have made efforts 
to equip tutoring environments with affective capabilities 
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(D’ Mello, Craig, Witherspoon, McDaniel, & Graesser, 2008; du 
Boulay et al., 2010). 

These trends in embodied agents and tutoring systems provides 
a further line of inquiry into the potential of embodied agent 
technology for addressing the equity issues in mathematics edu- 
cation. When teachers and parents simply presented the facts about 
the nonexistence of gender difference in mathematics learning, 
adolescent females became able to resist negative stereotypes 
concerning girls and mathematics (Jacobs et al., 2005). By pre- 
senting similar messages in the course of instructional guidance, an 
embodied agent tutor might inculcate positive attitudes toward 
mathematics learning and improve females’ self-efficacy beliefs. If 
this expectation turns out to be true, embodied agent technology 
will be able to expand the functionality of conventional tutoring 
systems, which have typically assumed a motivated learner instead 
of generating motivation (du Boulay et al., 2010). 

In this study, we conducted two phases of empirical inquiry. The 
first phase was an in vivo experiment, in which quantitative data 
were collected in natural classrooms. In the second phase, in-depth 
interviews were conducted to better understand the nature of 
students’ learning experiences with their agent. The guiding re- 
search question was: Will middle-grade females’ and Latinos’ 
reactions to an embodied agent be qualitatively different from 
Caucasian males’ reactions? 


Classroom Experiment 


Hypothesis 


In this classroom-based experiment, we investigated whether or 
not learner gender and ethnicity would influence learners’ evalu- 
ations of their agent, mathematics attitudes, mathematics self- 
efficacy, and learning gains. We tested four hypotheses: (a) Fe- 
males and Latino students would evaluate their agent more 
positively than Caucasian males; (b) females’ and Latinos’ atti- 
tudes toward learning mathematics from their agent would be more 
positive than Caucasian males’ attitudes; (c) females’ and Latinos’ 
self-efficacy in learning mathematics from their agent would be 
higher than Caucasian males’ self-efficacy; and (d) females and 
Latinos would increase their learning similar to Caucasian males 
after the intervention. In addition, if an embodied agent would 
have a positive influence on females and Latinos, theoretically, 
Latinas would be the group benefiting most from the agent; Cau- 
casian males would benefit least. The two groups were compared 
in each of the dependent measures. 


Method 


Participants. Participants were one hundred twenty 9th grad- 
ers enrolled in Algebra I classes in two inner-city high schools in 
a mountain-west state in the United States. Sixty-four students 
were male (53%) and fifty-six were female (47%). Sixty-one 
students were Caucasian (51%) and fifty-nine Latino (49%). In the 
participating school districts, students were able to start taking 
Algebra I in the seventh grade and required to complete it by the 
ninth grade. Thus, the participants who had delayed the course 
until required were assumed to be less interested in mathematics 
than the rest of the ninth graders in the schools. The average age 
of the participants was 15.93 (SD = 0.87). 
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Intervention: An agent-based algebra-learning environment. 
The intervention was two computer-based lessons, integrated with 
an embodied agent. The lessons were designed as supplemental 
materials to classroom learning, in which a learner reviewed the 
concepts individually that he or she had learned from the teacher 
and practiced solving problems to master the concepts. The agent, 
designed as a tutor, presented curriculum-related information and 
feedback and also verbally encouraged the learner to sustain in the 
task. The learning environment was self-contained, within which 
the learner typed in demographic information to log in, performed 
the learning task, and took pre- and posttests. 

Curricular content. Following the Principles and Standards 
of the National Council of the Teachers of Mathematics (http:// 
www.nctm.org), the curriculum was developed in collaboration 
with the algebra teachers in the participating schools, addressing 
their classroom needs. The two lessons, each taking one class 
period (approximately 50 min), dealt with combining like terms 
and distributive properties (Lesson 1) and graphing linear equa- 
tions using slope and y-intercept (Lesson 2). The lessons consisted 
of four to five sections, each section including two phases: (a) 
Review of Concepts and (b) Problem Practice. In the Review, the 
agent presented brief overviews of key concepts and examples. In 
the Problem Practice, a learner solved problems one at a time by 
way of drill-and-practice, listening to the agent’s feedback. The 
lessons were prescripted so that every learner could be exposed to 
all overviews and solve the same number of problems. The teach- 
ers helped identify the errors that students typically made in the 
classroom and helped write corrective feedback messages. Figure 
1 presents example screens of the lesson environment. 

Agent design. The design goal for the embodied agent named 
Chris was to simulate the instructional, social, and empathetic 
roles that might be played by an effective human tutor. We 
achieved the goal by including three features in agent design: (a) 
personalized instructions, (b) social and empathic rhetoric, and (c) 
peerlike image and voice. Regarding personalized instructions, 
while a learner worked individually at his or her own pace, Chris 
used the personal pronoun we in its explanations and feedback, 
emphasizing “‘a sense of togetherness.” The problem for the learner 
to solve was not his or hers but “our problem.” For social and 
empathic rhetoric, Chris used two types of messages in addition to 
curricular overviews and feedback: motivational and persuasive. 
Motivational messages were words of praise and verbal encour- 
agement presented when the learner made a mistake. Persuasive 
messages were statements about the benefits or advantages of 
doing mathematics well. The persuasive messages were integrated 
into the introductions to new sections and subsections so that every 
learner would hear persuasive statements. To promote agent— 
learner affinity, the messages adopted the teenagers’ style of 
speech. Two high school students translated the messages devel- 
oped by the design team into such teen-friendly speech. The 
Appendix presents examples of the agent messages. Lastly, we 
used peerlike image and voice to increase a sense of affinity. To 
control for the confounding effect by learners’ biases toward agent 
gender or ethnicity (Kim & Wei, 2011), we developed four ver- 
sions of an agent to match the students’ gender and ethnicity. One 
of the four agents was randomly assigned to a student. Also, to 
control for the confounding effect by agent appearance (Gulz & 
Haake, 2006), we morphed the four versions from one base image. 
Following that, we validated the agent images with another group 
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of 200 high school students, which confirmed that the images 
looked exactly as intended. Agent voices were recorded by four 
adolescent voice actors, in consideration of the friendliness of the 
human voice as compared with a synthesized one (Mayer, Sobko, 
& Mautone, 2003). Lastly, we added facial expressions, eye gaze, 
and head nodding to make the agents look more natural and 
believable. 

Variables and measures. Independent variables included 
learner gender and ethnicity, each having two levels: male Cau- 
casians (35), female Caucasians (26), Latinos (29), and Latinas 
(30). Dependent measures were learners’ evaluations of an agent, 
mathematics attitudes, mathematics self-efficacy, and learning 
gains. 

Learners’ evaluations of an agent. Learners’ evaluations of 
their agent were measured with a 17-item questionnaire using a 
Likert scale ranging from 1 (Strongly disagree) to 7 (Strongly 
agree). The items asked whether the agent was friendly and helpful 
for learning and whether the learner desired to work with the agent 
again (e.g., “Chris was friendly,” “Chris was easy to understand,” 
and “T’d like to learn from Chris again’). Interitem reliability was 
evaluated as a = .96. 

Mathematics attitudes. Mathematics attitudes were defined as 
learners’ overall evaluative responses to learning mathematics 
(Petty, DeSteno, & Rucker, 2001). Pre- and posttest items, scaled 
from 1 (Strongly disagree) to 7 (Strongly agree), were derived 
from the Attitudes Toward Mathematics Inventory (http://www 
rapidintellect.com/AEQweb/cho25344].htm). The five-item pre- 
test measured learners’ general attitudes toward learning mathe- 
matics (e.g., “In general, I like learning math”). The pretest was 
used as a covariate in the analysis; the interitem reliability evalu- 
ated with coefficient « = .80. The posttest included two categories 
of attitudes. One category measured learners’ general attitudes 
(same as the pretest); the other measured learners’ attitudes spe- 
cifically toward learning mathematics from the agent in the lessons 
(two items), for example, “I liked solving math problems with 
Chris in this lesson”) Posttest interitem reliability was evaluated as 


a = .84. 
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Lesson 2: Graphing Linear Equations 
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The example screens of the agent-based learning environment. 


Mathematics self-efficacy. Mathematics self-efficacy was de- 
fined as learners’ beliefs in their capability to successfully learn 
mathematics (Bandura, 1997). Following Bandura’s (2006) guide- 
lines, pre- and posttest items were developed and ranged from 1 
(Strongly disagree) to 7 (Strongly agree). The five-item pretest 
measured the learners’ general self-efficacy beliefs in learning 
mathematics (e.g., “In general, I am confident in learning math’). 
The pretest was used as a covariate; interitem reliability was 
evaluated as a = .84. The posttest included two categories of 
self-efficacy. One category measured learners’ general self- 
efficacy (Same as the pretest); the other category measured their 
self-efficacy specifically in learning mathematics from the agent 
(four items), for example, “I was confident in solving problems 
with Chris in this lesson.” Posttest interitem reliability was eval- 
uated as a = .86. 

Learning gains. Learning was measured with a pretest and an 
immediate posttest. After logging into the system, the learners 
solved 16 problems; at the end of the lesson, they solved another 
set of 16 equivalent problems. For example, one item in pretest 
asked the learners to distribute the expression 3a(x + y); the 
matching posttest item asked to distribute the expression 5x(a + 
b). The items were presented one after another; the format was 
similar to Figure 1, Question 4 on the left, without agent presence. 
Students used scratch paper and pencil to solve a problem and 
typed in their answers in a blank. Each item was scored correct (1) 
or incorrect (0), with the maximum score of 16 and no partial 
scores. 

Procedure. We implemented the experiment as regular activ- 
ities in the classroom (using 34 laptop computers) on 2 consecutive 
days, one lesson per day. On Day 1, students were given a brief 
introduction about the lesson and interface and then asked to put 
on headphones. They entered demographic information to log onto 
the lesson. Upon login, they took pretests. Following that, one of 
the four agents (differing in gender and ethnicity) was randomly 
assigned to a student. Students performed the learning task, listen- 
ing to Chris’ overviews and feedback. On Day 2, students were 
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assigned to the same agent and performed the learning task in the 
same manner. Lastly, they took posttests without Chris. 

Design and analysis. A 2 X 2 factorial design was used, in 
which both learner gender and ethnicity had two levels. To analyze 
learners’ evaluations of an agent, a two-way analysis of variance 
(ANOVA) was conducted. To analyze attitudes and self-efficacy 
(each having two subcategories), 2 two-way multivariate analyses 
of covariance (MANCOVAs) were conducted, respectively, with a 
pretest set as a covariate to control for the group difference in the 
pretest. To analyze learning, a two-way repeated analysis of co- 
variance (ANCOVA) was conducted, with a pretest set as a cova- 
riate. The significance level was set at a < .05. 


Results 


A preliminary analysis of the data was conducted to ensure that 
the assumptions of the parametric statistics were met. Visual 
examination of scatterplots supported the assumption of normality 
and revealed linear relationships. Levene’s test was conducted to 
test the equality of error variance for each ANOVA procedure; 
Box’s test was conducted to test the equality of covariance for each 
MANCOVA procedure. These tests did not reveal any significant 
problems with the equality of error variance and covariance. Table 
1 presents the means and standard deviations for learners’ evalu- 
ations of an agent, mathematics attitudes, and mathematics self- 
efficacy. 

Gendered and ethnicity-based positivity of agent 
evaluations. The two-way ANOVA indicated a significant main 
effect of learner gender, F(1, 116) = 8.22, p = .005, "7 = .07. The 
females evaluated their agent significantly more positively than 
did the males. Also, there was a significant main effect of learner 
ethnicity, F(1, 116) = 22.87, p = .000, 1? = .17. The Latinos 
evaluated their agent significantly more positively than did the 
Caucasians. A planned two-independent group ¢ test was further 
conducted to compare Latinas with Caucasian males. The results 
revealed that the Latinas evaluated their agent significantly more 
positively than did the Caucasian males (t = —5.54, p = .000, 
d = —1.39). 

Gendered inflection of attitudes. The two-way MANCOVA 
revealed a significant main effect of learner gender (Wilks’s A = 
.95), F(2, 114) = 2.95, p = .046, partial n? = .05. Given the 
overall significance, a univariate analysis was further conducted to 
examine the contribution of each category of attitudes to the 
overall significance. There was a significant main effect of learner 
gender on the attitudes specifically toward learning mathematics in 
the agent-based lessons, F(1, 115) = 4.82, p = .030, 1 = .04. The 
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females showed significantly more positive attitudes than did the 
males. A planned contrast between Latinas and Caucasian males 
revealed a similar pattern that the Latinas showed significantly 
more positive attitudes than the Caucasian males (t = —2.42, p = 
.019, d = —0.6). 

Gendered enhancement of self-efficacy. The two-way 
MANCOVA revealed neither main effect nor interaction effect of 
learner gender and learner ethnicity on learners’ mathematics 
self-efficacy (p = .783). Nonetheless, the goal of the learning 
environment was to help students build their confidence in math- 
ematics learning; we inquired about any group difference in the 
improvement of their self-efficacy after the intervention. A two- 
way repeated ANOVA was conducted to examine changes in 
learner self-efficacy from pretest to posttest. Because the number 
of the items in the two tests was not matched, the posttest scores 
were statistically converted to match the pretest scores. There was 
a significant interaction effect of the within-subject factor (time) 
and learner gender, F(1, 116) = 7.47, p = .007, 17? = .06. Females 
significantly increased their self-efficacy from pretest to posttest, 
whereas males did not show the increase. A contrast between 
Latinas and Caucasian males revealed a similar interaction pattern 
that revealed only the Latina’s significant increase in their self- 
efficacy, F(1, 63) = 4.98, p = .029, n? = .07. 

Mathematics learning in a socialized environment. The 
analysis of learning included 69 students, only those who had 
completed algebra posttests in both days. Table 2 presents the 
means and standard deviations of the pre- and posttests. The 
ANCOVA result revealed neither a main nor an interaction effect 
of student gender and ethnicity on learning, F(1, 64) = 3.17, p = 
.080, 1? = .05. We also conducted a two-way repeated ANOVA 
to test the groups’ learning gains over time. There was a significant 
main effect of the within-subject factor (time), F(1, 65) = 53.28, 
p = .000, n* = .45. A planned contrast between Latinas and 
Caucasian males did not reveal a significant difference in their 
learning gains. Overall, regardless of their gender and ethnicity, 
the student groups significantly improved their learning after 
working in the agent-based lessons. 

To summarize, the ninth-grade females evaluated their agent 
significantly more positively than did males and the Latinos sig- 
nificantly more positively than did Caucasians. Second, the fe- 
males showed significantly more positive attitudes toward the 
agent-based learning than did males. Third, the females signifi- 
cantly increased their mathematics self-efficacy after the agent- 
based learning, whereas the males did not show the increase. These 
gender differences in evaluations of an agent, attitudes, and self- 


Means and Standard Deviations for Posttest Learners’ Evaluations of an Agent, Attitudes, and Self-Efficacy 





Learner groups (N = 120) 











Female Male 
Latina (n = 30) Caucasian (n = 26) Latino (n = 29) Caucasian (n = 35) 
Measure M SD M SD M SD M SD 
Evaluation of their agent 86.3 4.13 69.23 4.43 TING 4.20 54.57 3.82 
Attitudes toward the lessons 9.04 0.50 7.93 0.55 7.56 0.51 TA9 0.45 
Self-efficacy in the lessons 20.11 0.80 19.68 0.86 18.46 0.80 18.93 0.75 
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Means and Standard Deviations for Pre- and Posttest Learning Measures 





Learner groups (N = 69) 








Female Male 
Latina (n = 14) Caucasian (n = 18) Latino (n = 14) Caucasian (n = 23) 
Measure M SD M SD M SD M SD 
Learning 
Pretest 6.14 3.80 8.22 2.24 6.43 4.62 7.96 Qe 
Posttest 9.29 4.12 10.78 1.86 7.64 5.20 11.04 2.65 


efficacy were even more clearly manifested between the Latinas 
and the Caucasian males. Lastly, regardless of learner gender and 
ethnicity, all the groups significantly increased their learning after 
the lessons. Overall, the results suggested that the males and 
females have qualitatively different experiences in agent-based 
learning. In-depth interviews were conducted to better understand 
the nature of the students’ experiences and the agent’s character- 
istics that might most appeal to students of this age. 


In-Depth Interview 


Method 


Interviewees. The interviews were focused on a deeper un- 
derstanding of the females’ experiences and the clear contrast 
between Latinas’ and Caucasian males’ reactions. Initially, 12 
interviewees were randomly selected from the participant pool that 
had completed both lessons, which resulted in a sample of eight 
Caucasian males, two Caucasian females, and two Latinas. A 
second round of sampling was conducted to obtain a theoretical 
sample (six to eight) from each comparison group, to ensure a 
meaningful, thematic analysis. The sampling targeted the two 
female groups, from which six Caucasian females and four Latinas 
were further selected randomly. 

Procedure. A\ll interviews were conducted individually at the 
high schools and followed the lesson implementations. Three 
trained doctoral students conducted the interviews (each taking 
20-30 min), using a loosely structured interview protocol that 
listed a set of main questions, for example, “What did you like or 
dislike about the lessons with a peer-like tutor?” and “What would 
you suggest for the improvement of the lessons?” The protocol 
allowed room for exploration and probing when necessary. To 
ensure the confidentiality of the interviewees, all identifiable com- 
ments were eliminated from the transcripts, and pseudonyms were 
used in the analysis. 

Data analysis. The Constant Comparison method (Charmas, 
2006) was used to identify major differences between the female 
and male groups. To ensure the quality and trustworthiness of the 
findings, the research team analyzed the interview data through a 
collaborative, reiterative process. Both authors read all transcripts 
individually and brainstormed several salient themes. Following 
that, the second author with the help of two graduate assistants 
launched a more systematic, thorough analysis, using the software 
Atlas ti (Version 6). Next, they summarized key information about 
the interviewees’ experiences with the agent, their classroom ex- 
periences, and other critical information, such as the perceptions of 


the agent, familiarity with computers, and the level of their atten- 
tion to the agent. On the basis of recurring information in the 
summary, a list of open codes was developed. These codes were 
appended to relevant quotations in each transcript. A code output 
was generated. The team examined the output carefully to detect 
major patterns across the 22 interviewees and possible consistent 
relationships in the patterns. Lastly, the team elicited three main 
themes. 


Results and Interpretation 


Gendered perspectives and relationship building. The fe- 
males and males demonstrated different views of their agent and 
developed different types of relationships with it while they en- 
gaged in the learning task. The males seemed to treat the agent as 
a mere tool and showed detached attitudes toward it, describing it 
as an “unnatural” or “fake” person and not being able to recall its 
name, gender, or ethnicity. They listed both positive and negative 
aspects of the agent. In most cases, their negative comments were 
longer and more varied than the positive ones. They found the 
agent’s unsolicited explanations rather “annoying” and “boring.” 
About half the males reported that they “turn[ed] off the voice,” 
“skipped,” or “ignored” narratives of their agent that they found 
not helpful. Mark’s comments demonstrated this distant view of 
the agent: 


Interviewer: So how did you find Chris [the agent] similar to a peer 
or friend? 


Mark: I didn’t really think of it as a friend, I just thought of it as like 
a little computer thing. (Interviewer: Oh really?) But yeah. I just don’t 
really think that like computers are supposed to be your friend. 


In contrast, the females seemed to treat their agent as if it were 
a friend or companion, and they built a humanlike, person-to- 
person relationship with their agent. They always called their agent 
by its name “Chris,” used personal pronouns she or he to refer to 
the agent, and paid attention to various aspects of the agent (e.g., 
facial expressions, its hair style, the tone of speech). Not surpris- 
ingly, they reported their experiences with the agent-based learn- 
ing very positively. This phenomenon was far more evident among 
the Latinas. Not one of the Latinas made negative comments about 
the agent; rather, all of them were effusive about their enjoyment 
in working with the agent. Perla (Latina) described her agent as 
being “really nice always” and “just like human thing ... that 
someone is telling you compliments.” Janet (Caucasian) said that 
her agent was “the person next to you [who] would help you with 
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whatever problem you need.” By and large, the development of a 
humanlike relationship with their agent seemed to generate posi- 
tive effects on their learning process, but some negative conse- 
quences were also observed. Some girls were distressed by the 
agent’s negative feedback. The agent was “not friendly” but 
“mean” and “rude” and made them feel “hurt.” These girls seemed 
to project their interpersonal expectations onto the agent and were 
disappointed when their expectations were not met properly. 

Consequences in learning: Students’ evaluations of agent 
effectiveness. Overall, both male and female students liked the 
agent’s immediate and individualized feedback. However, the 
males said that the explanations were sometimes redundant or, at 
other times, not specific enough. Although they valued the agent’s 
ability to provide feedback and/or to alert them to their mistakes, 
the males often skipped or turned off lengthy explanations to 
directly tackle the problem on their own. Only two males out of 
eight listed “good explanation” as a strength of the agent. The 
males’ complaints were mainly related to weak explanations not 
tailored “for me.” Rick’s comments exemplified the males’ 
reactions: 


The only reason I marked that [evaluated negatively] I didn’t really 
like it “cause sometimes it explained like too much, like at the 
beginning of each section or something. It kept it kind of went on and 
on for me, so it just kind of got annoying for me to have to keep 
listening. 


Conversely, the females spoke highly of the quality and rele- 
vance of the agent’s explanations. Almost all of them said that 
their agent had provided good explanations, which were “clear” 
and “very specific.” Selena said that the agent “explained every 
little part of it,” and “when I would get confused, she would 
explain what I did wrong clearly.” The girls rarely mentioned the 
actions often taken by the boys (e.g., skipping lengthy explana- 
tions). As a result, the girls were more likely to attend to and 
benefit from the coordinated instructional features (e.g., voice 
narration with the accompanying texts on the screen). Abby’s 
positive view consistently appeared in almost all females’ com- 
ments, “I would really enjoy it because like it explained it how to 
do it and it had visuals of how to do it, and it would explain how 
to go step by step. So it would be really helpful.” Presumably, the 
companionship that the females had built with the agent estab- 
lished a positive context for the subsequent learning process and 
made the females willing to listen to even lengthy explanations. 

Connection between real and virtual contexts in learning 
experience. The gendered pattern of learning experiences with 
the agent did not seem to occur in a social vacuum. Rather, 
students’ views of their agent and the quality of their learning 
experiences with it seemed to be influenced by their everyday 
classroom experiences. The students who felt less supported in the 
classroom tended to develop positive attitudes toward the agent- 
based learning and reported positive learning experiences. The 
Caucasian males rarely expressed psychological stress in their 
classrooms; only two males showed a glimpse of social discon- 
nection from their teacher. In contrast, all the females mentioned, 
at least once, a negative experience and/or feelings of insecurity in 
their classrooms. Most of them stated that their teacher did not care 
about their learning and was not willing to help them when they 
faced a difficulty. 
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This phenomenon was more manifested among Latinas. 
Whereas three Caucasian girls out of eight mentioned some pos- 
itive aspect of the classroom, the Latinas’ narratives presented a 
greater disconnection in their relationships with the teacher and 
even a sense of fear and intimidation. Daniela (Latina) expressed 
discomfort with her teacher’s tone of voice: “They [teachers] teach 
you but sometimes you don’t get the thing and they teach again but 
in different voice.” In response to the question about the difference 
between the agent Chris and the teacher, Selena contrasted “upbeat 
and friendly” Chris with her “kind of intimidating” teacher. Al- 
though none of the Latinas showed difficulty with conversing in 
English, many indicated that the ordinary classroom instruction 
was “too fast,” and the teacher was “leaving you alone” even when 
students did not grasp the concepts. They felt relieved working 
with their agent, who would never blame them for not catching up 
to its speed. Some even argued, “You can learn more, and they 
[agents] teach you more, better than the teacher.” 

To summarize, the interviews revealed that the males and fe- 
males developed different relationship patterns with their agent. 
This gendered pattern of relationship building resulted in their 
differential evaluations of the quality and effectiveness of the 
agent’s feedback and explanations. Also, the students’ experiences 
with the agent were closely related to their everyday classroom 
experiences. The females, psychologically marginalized in the 
classroom, perceived the agent as a genuine companion who 
kindly helped them learn step by step. 


General Discussion 


This study was grounded in two theoretical premises. First, 
inequity issues in STEM education are attributable to. the unsup- 
portive context in STEM classrooms for traditionally underrepre- 
sented groups of students. To address this issue, educators should 
contrive supportive learning contexts, in which these students feel 
cared for and encouraged to engage in STEM learning. These 
contexts should accommodate the learning styles of the students 
who favor multifaceted interactions and social relations. Second, 
an embodied agent, with its social and empathetic capabilities, 
might afford humanlike interactions with the students. If designed 
carefully, agent-based learning could create a socially rich and 
inclusive context for those groups of students and, thereby, support 
their positive learning experiences and sustained intellectual pur- 
suit in STEM. On the whole, the results from both phases of this 
study support the premises and show agent technology to be a 
promising tool in the resolution of urgent educational issues. 
The results also argue for the expansion of advanced learning 
technology. 

Consistent with the present literature, the classroom experiment 
revealed clear gender differences in responses to agent-based 
learning. Females’ preference for social interactions and relation- 
ship building in the classroom seemed to be reflected consistently 
in their evaluations of their agent. Females rated the agent with an 
average of 4.6 on the 7-point scale, and males with an average of 
3.9. In particular, Latinas rated the agent with an average of 5.1, 
and Caucasian males with an average of 3.2. The females’ favor- 
able evaluations of their agent seemed to lead them to build more 
positive attitudes toward learning from the agent and to increase 
their self-efficacy after the lessons. Moreover, the females signif- 
icantly increased their mathematics learning comparable to their 
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male counterparts after working with the agent. At a minimum, the 
conventional achievement gap favoring Caucasian males was not 
observed in the agent-based lessons. When the females realized 
that they had instructional support and were free from social 
embarrassment, they were more likely to engage and not be afraid 
of making mistakes, as indicated in the interviews. 

The interviews supported the quantitative results of the class- 
room experiment and illuminated the nature of gender differences 
in the responses to the agent-based learning. First, both Latinas and 
Caucasian females engaged themselves in interactions with the 
agent and responded socially. Although the females admitted their 
agent to be a computer program, their interactions with it resem- 
bled their everyday social interactions with a friend in many ways. 
This implies that quality relationships must be important for fe- 
males’ mathematics learning even in technology-based environ- 
ments. Regardless of a virtual or real space, relationship building 
is an essential and natural part of many females’ learning process 
as explained by feminist scholars (Belenky et al., 1997; Noddings, 
2003). The development of companionship with their agent pro- 
vided the females with some advantages. It effectively engaged 
them in the task and let them be patient throughout the lessons. 
Also, it reduced the chance of experiencing the negative emotions 
that many females had in ordinary mathematics classrooms. 

The study revealed that a persistent cultural and social discon- 
nection existed between the females (more with Latinas) and their 
teachers. The students acknowledged that their teachers were over- 
burdened with teaching a big class. Still, their feelings of discon- 
nection were a challenge to their engagement and success in school 
mathematics (Lim, 2008). The features that they listed as support- 
ive of their learning during the lessons were similar to the char- 
acteristics of culturally relevant pedagogy (Gay, 2000). The Lati- 
nas earnestly expressed their need for a psychologically “safe” 
space, where they could ask for help freely as many times as 
needed. The provision of a communal sense of learning—working 
together closely with someone willing to help—was a strength of 
the agent-based learning. Their feelings of connection to the agent 
and the agent’s social encouragement seemed to lead them to full 
engagement in the task (Sciarra & Seirup, 2008). 

The study also confirmed the trends in human—computer inter- 
action and further extended our understanding in the area. The 
more computers present humanlike characteristics, the more likely 
they are to elicit social behavior from users (Lee, Jung, Kim, & 
Kim, 2006). Likewise, the agent Chris, looking peerlike, success- 
fully elicited the females’ social responses. Once the females 
identified their agent as a helper for their learning, it did not matter 
whether the helper was real or artificial (Turkle, 2011). More 
importantly, the study revealed that the boundary between real and 
virtual spaces was blurred. Students’ online learning experience, 
either positive or negative, could be better understood in relation to 
their everyday classroom experience. Their learning experiences in 
the two spaces are closely interrelated, each providing an impor- 
tant context for the other and each influencing the other. In similar 
fashion, the females’ (particularly the Latinas’) positive experi- 
ences with the agent-based learning were influenced largely by 
their marginalized experiences in the everyday mathematics class- 
rooms. An implication for the designers of advanced learning 
technology is that the careful observation and accurate understand- 
ing of challenges that students face in the classroom might be a 


primary step in designing effective technology-based learning en- 
vironments. 

Several previous studies on embodied agents have reported that 
learners perceived a matched agent with their own gender -or 
ethnicity more positively than a mismatched one (e.g., Kim & Wei, 
2011; Moreno & Flowerday, 2006), indicating that social biases in 
the real world were consistently applied to agent—learner relations. 
However, our interest was in examining the potential of agent 
technology for countering existing stereotypes and biases. We 
focused on the motivational and persuasive role that a peerlike 
agent would play, regardless of its gender and ethnicity. Neither 
agent gender nor ethnicity was examined as a factor; instead, four 
versions of an agent differing in gender and ethnicity were ran- 
domly assigned to students, to control for a confounding effect by 
the learners’ biased perceptions. In the interviews, the Latinas who 
worked with a Latino agent tended to express a higher level of 
affection for the agent than the Latinas who worked with a Cau- 
casian agent. Nonetheless, all the Latinas agreed that their agent, 
either Caucasian or Latino, was a great helper. 

Recently, there has been a growing awareness about the social and 
cultural aspect of females’ and ethnic minorities’ learning processes 
(Carr & Steele, 2009; Nasir et al., 2006). It is clear that more research 
is called for in designing effective learning technology for these 
students. This technology needs to support their identification with 
STEM topics and to include specific features that stimulate motiva- 
tion. Technology-use trends in the United States show that African 
American and English-speaking Latino youths use Internet and mo- 
bile data more frequently than do Caucasian youths (Smith, 2010); 
thus, games and mobile technology could be a functional space for 
inviting these students’ attention to STEM topics. 

Lastly, the study had a few limitations. First, the agent in the 
study was designed to be a whole entity with instructional, social, 
affective, and aesthetic attributes. The interactions among these 
attributes and their relative contributions to the females’ positive 
experiences should be clarified in the subsequent research. Second, 
both quantitative and qualitative data were collected in specific 
locations and with 2-day implementations. The findings should be 
generalized judiciously. Third, based on the results of the experi- 
ment, the interviews were focused on the contrast between the 
females and Caucasian males. Much is unknown in regard to 
Latino males’ interactions with their agent and their experiences in 
the learning environment. Future research is warranted to over- 
come the limitations and confirm the findings of the present study. 
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Appendix 


Agent Messages Examples 


1. The agent presented persuasive (P) and informational 
(I) messages in the introductions to new sections and 
subsections. 

At the start of the section on Distributing to Combine Like 
Terms, the agent said: 


Hey, we are doing great. You know, if we do well in math, we can 
major in anything we want in college because many jobs require an 
understanding of math. Developing math skills now will give us 


more opportunities later (P). Alright, in this section, we are learn- 
ing how to distribute first and then combine like terms. Terms 
don’t always come combined ... (1). 


2. The agent presented motivational and informational feed- 
back in a sequence while students solved problems. 
Question 5: Simplify the expression by distributing and com- 


bining the like terms. 
4r + 4 + s) = 





Possible answers typed Type of errors 
8r + 4s None 
4r + 4r + 4s Like terms not combined 
&r+s Partial distribution 
Any other (1st try) Random 
Any other (2nd try) Random 


Motivational feedback 


Excellent. Let’s keep 
up the good work. 

Everybody makes a 
mistake. Let's learn 
from the mistakes 
we've made. 


It was a challenge, 
but hang on there. 
It will pay off in 
the end. 


Agent messages 


Informational feedback 


There is one more step after distributing. 
Let’s check for like terms and 
combine them. 


Distribute the 4 to both terms in the 
parentheses and then combine like 
terms. 

First, distribute the 4 to both r and s in 
the parentheses, and then look for like 
terms. 

Distribute the 4; that means we now 
have 4r plus 4r plus 4s. The first two 
terms are alike, so we combine them 
and the answer is Sr plus 4s. 
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This study evaluated whether the Targeted Reading Intervention (TRI), a classroom teacher professional 
development program delivered through webcam technology literacy coaching, could provide rural classroom 
teachers with the instructional skills to help struggling readers progress rapidly in early reading. Fifteen rural 
schools were randomly assigned to the experimental or control condition. Five struggling readers and 5 
non-struggling readers were randomly selected from eligible children in each classroom. There were 75 
classrooms and 631 children in the study. Teachers in experimental schools used the TRI in one-on-one 
sessions with | struggling reader in the regular classroom for 15 min a day until that struggler made rapid 
reading progress. Teachers then moved on to another struggling reader until all 5 struggling readers in the class 
received the TRI during the year. Biweekly webcam coaching sessions between the coach and teacher allowed 
the coach to see and hear the teacher as she instructed a struggling reader in a TRI session, and the teacher 
and child could see and hear the coach. In this way the classroom teacher was able to receive real-time 
feedback from the coach. Three-level hierarchical linear models suggested that struggling readers in the 
intervention schools significantly outperformed the struggling readers in the control schools, with effect sizes 
from .36 to .63 on 4 individualized achievement tests. Results suggested that struggling readers were gaining 
at the same rate as the non-struggling readers, but they were not catching up with their non-struggling peers. 


Keywords: individualized instruction, literacy coaching, educational technology, rural classroom teacher, 


struggling readers 


American schools have come under increasing scrutiny, largely 
because many children are not acquiring the skills they need to 
succeed in the larger culture (Grissmer, Flanagan, Kawata, & 
Williamson, 2000). The National Center for Education Statistics 
(2009) has reported that two thirds of fourth graders are not able to 
comprehend difficult texts, and 63% of fourth graders are reading 
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only at a very minimal level of proficiency. Of those families in 
poverty, only 28% of their children are reading at this minimum 
level of proficiency in fourth grade (Haager, Klingner, & Vaughn, 
2007; Lyon, 2001). These low levels of reading proficiency are 
especially true for rural children from low-wealth communities 
who come to school with lower readiness skills than other children 
(Lee & Burkham, 2002). These lower readiness skills are due in 
part to the proportionately greater child poverty rates in rural 
versus urban areas with the gap between rural and urban poverty 
growing over the last 10 years (O’ Hare, 2009). Since poverty is the 
most potent predictors of school success, even greater than mother 
education, two parent families, and a host of other demographic 
variables (Brooks-Gunn & Duncan, 1997), it is important to un- 
derstand the context of schooling in these low wealth rural com- 
munities as well as develop and evaluate school programs that may 
be effective for children in the context of poverty. 

The higher child poverty rate in rural communities impacts 
schooling, with a poorer tax base for schools, lower teacher pay, 
less educated teachers, and less access to educational resources 
(Amendum, Vernon-Feagans, & Ginsberg, 2011; Vernon-Feagans 
et al., 2012; Provasnik et al., 2007). When trying to improve 
student achievement, rural schools face challenges of geographic 
isolation and low population density that often lead to less ready 
access to state-of-the-art professional development for teachers 
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coupled with less access to technology in the classroom (Dewees, 
2000; Vernon-Feagans, Gallagher, & Kainz, 2010). The issues 
faced by rural schools were underscored by a Government Ac- 
countability Office (2004) report that sampled rural school prin- 
cipals. The report highlighted rural school needs for better tech- 
nology and teacher professional development. 

The professional development program for classroom teachers 
evaluated in this article tries to address some of the needs of 
schools in rural low wealth schools with respect to both technology 
and teacher professional development. The Targeted Reading In- 
tervention (TRI) provides teachers with professional development 
for struggling readers through state-of-the-art webcam coaching 
that allows literacy coaches thousands of miles away to provide 
real time feedback to teachers in their classrooms as the teachers 
instruct struggling readers. The program also provides extensive 
website materials for instruction, webcam workshops and webcam 
team/grade level meetings, as well as e-mail correspondence be- 
tween teacher and coach. 


Technology and Early Reading 


Most of the previous research on the use of technology for early 
reading has focused on computer assisted instruction (CAI) devel- 
oped for use by children who need or want additional instruction 
and practice in reading. This technology allows students to work 
on their own to supplement regular instruction in the classroom, 
minimizes teacher involvement, and has been shown to be effec- 
tive in improving the early reading skills of children, including 
children with different skill levels and different ethnic and socio- 
economic backgrounds (Blok, Oostdam, Otter, & Overmaat, 
2002). 

Recently, there has been particular emphasis on developing and 
examining CAI for children at risk for early reading disability 
(Chambers et al., 2011; Huffstetter, King, Onwuegbuzie, Sch- 
neider, & Powell-Smith, 2010; Saine, Lerkkanen, Ahonen, Tolva- 
nen, & Lyytinen, 2011; Torgesen et al., 1999). These studies have 
demonstrated that children at risk for reading problems can prog- 
ress in basic reading skills through CAI delivered by trained and 
specialized teachers/tutors in the resource room setting or in a 
mobile computer lab. One study demonstrated that an extended 
day program that used a web-based instructional framework was 
more effective than direct instruction delivered by a specialized 
teacher for children with significant reading delays in elementary 
school (Cole & Hilliard, 2006). Chambers et al. (2011) demon- 
strated that schools that used tutor-led small group instruction with 
a reading software could significantly improve the reading of 
struggling students in comparison to schools that did not use this 
tutor and software. Although these studies were important in 
underscoring the value of CAI for young at risk readers, they were 
likely costly if sustained because of the need for a specialized 
trainer or teacher who assisted the children during CAI. 

Little research has focused on using technology to help the 
classroom teacher become more effective in instructing struggling 
readers except to introduce teachers to ancillary software that can 
supplement instruction in the classroom. A survey of elementary 
school teachers suggested that teachers used technology as a 
supplemental tool for instruction but did not use technology as the 
central tool for instruction (Franklin, 2007). Thus, studies of tech- 
nology use by classroom teachers have assessed the effectiveness 
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of ancillary software packages for improving reading with mixed 
results as to the efficacy of such software for struggling readers. 
For instance, Lewandowski, Begeny, and Rogers (2006) found that 
at-risk elementary school readers practicing alone did not improve 
fluency, whereas both tutor- and computer-assisted groups of 
children significantly improved in reading speed and accuracy. 
Struggling readers who received training via computer performed 
as well as students who received individualized tutoring. On the 
other hand, Mathes, Torgesen, and Allor (2001) found that al- 
though the Peer Assisted Learning Strategies (PALS) reading 
program improved student reading, the addition of a phonological 
awareness computer software for struggling readers did not sig- 
nificantly improve reading over the traditional PALS program. 
Again, these computer programs probably saved time for the 
classroom teacher since the teachers did not have to be as involved 
in student learning but may have failed to help improve classroom 
teacher instructional literacy practices. 

Some recent studies have focused on using technology to im- 
prove the teaching of preschool classroom teachers who are in 
Head Start or in pre-kindergarten programs for children from at 
risk backgrounds. These studies have used video and web based 
video platforms to promote effective professional development in 
literacy for these preschool teachers. In a series of studies exam- 
ining the effectiveness of My Teaching Partners, teachers were 
asked to video themselves and then send the DVDs to a research 
team who in turn would give the teachers feedback on their 
classroom literacy practices in a few weeks or a month. Teachers 
also had access to a website for information on the program. This 
kind of professional development technology has proven effective 
for preschool teachers who serve a diverse group of learners 
(Mashburn, Downer, Hamre, Justice, & Pianta, 2010; Pianta, 
Mashburn, Downer, Hamre, & Justice, 2008). Especially interest- 
ing for the current study was Mashburn et al.’s (2010) study that 
compared two delivery systems to preschool teachers in a random- 
ized control trial. Preschool teachers in the first condition had 
access to a literacy video library via a highly developed website 
with instructional materials for the teachers to easily access. The 
second condition allowed teachers access to the literacy video 
library but also allowed teachers to view their own teaching video 
clips on the website with reflective questions about their instruc- 
tion. In addition, this group also participated occasionally in video 
conferencing with a literacy/language coach to discuss teaching 
practices. Mashburn et al. found that preschool classroom teachers 
who had both access to the video library but also had occasional 
coaching via the website and videoconferencing improved their 
children’s vocabulary skills more than teachers who only had 
access to the video library. Another important recent study used a 
randomized control trial to examine the effectiveness of a literacy/ 
language professional development program for preschool teach- 
ers called Classroom Links to Early Literacy under two different 
conditions: live literacy coaching of teachers versus video coach- 
ing (Powell, Diamond, Burchinal, & Koehler, 2010). In this case, 
both face to face and video coaching involved observing the 
teacher for 90 min every 2 weeks and giving her oral and written 
feedback in addition to written feedback on the videotaping of 
herself during teaching. This one semester study found that in both 
conditions, children who used their program had large gains in all 
areas of literacy and that there were no differences between the 
live and video conditions. This latter study suggests, as other 
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research without technology has found, that professional develop- 
ment with the addition of coaching may be the most effective way 
to improve the instruction of classroom teachers, especially in 
literacy. 

A series of studies by Connor and colleagues (Al Otaiba et al., 
2011; Connor et al., 2011; Connor, Morrison, & Petrella, 2004; 
Connor et al., 2009) have used technology in early elementary 
school in a different way to help classroom teachers. They have 
used software that individualizes instruction in reading for children 
and have used literacy coaches that help the classroom teacher use 
the software effectively in individualizing instruction. Their In- 
struction X Skill interaction studies have been very innovative and 
have certainly shown that their software in conjunction with live 
literacy coaching in the classroom is very effective in helping all 
children progress in early reading. 

These innovative technologies for delivering professional devel- 
opment to teachers in the form of a website that contained teacher 
videos of themselves teaching with later feedback and using inno- 
vative software to individualize instruction, although important, 
may be limited because classroom teachers were not able to 
receive immediate real-time feedback about their teaching prac- 
tices with individual children that may be only accomplished 
through real time coaching of classroom teachers (Carlisle & 
Berebitsky, 2011; Elish-Piper & L’ Allier, 2011; McGill-Franzen, 
Allington, Yokoi, & Brooks, 1999). In addition, these programs 
did not help teachers directly with their instruction of struggling 
students but rather focused on improving effective instruction for 
all children in the class. 


Coaching, Early Reading, and the Targeted Reading 
Intervention 


This recent research using technology supports the previous 
work on the importance of literacy coaching as a way to scaffold 
the skills of classroom teachers to make changes in classroom 
instruction. Research over the last 10 years has suggested that the 
most effective way to promote better teaching of reading by 
classroom teachers is by developing professional development 
programs that include the addition of ongoing support of teachers 
through the effective use of literacy coaches. Professional organi- 
zations like the International Reading Association (2004) and other 
research on coaching (Elish-Piper & L’Allier, 2011; McGill- 
Franzen et al., 1999; McKenna & Walpole, 2008) have demon- 
strated that materials and workshops alone are not enough to 
improve literacy instruction for classroom teachers, but the addi- 
tion of having a literacy coach for the classroom teacher can 
improve teacher reading practices that are linked to improved 
student outcomes. For instance, Carlisle and Berebitsky (2011), in 
a quasi-experimental study of Reading First, found that profes- 
sional development workshops alone were not as effective in 
improving first grade student decoding skills as professional work- 
shops that also included literacy coaching over the school year. 
Coaches in this study visited classrooms and worked one-on-one 
with teachers to give feedback on their teaching, modeled methods 
of instruction, and served as a literacy resource. This real-time 
feedback for classroom teachers was available using previous 
technology (Mashburn et al., 2010; Pianta et al., 2008; Powell et 
al., 2010) because teachers’ videos of themselves teaching requires 
extended time by literacy coaches/consultants to observe the vid- 
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eos and provide feedback to teachers. Furthermore, previous video 
coaching has had coaches watch the teacher for prolonged periods 
of time and then the teacher received delayed feedback on literacy 
instructional practices. ¢ 

As Kennedy and Deshler (2010) have recommended, technol- 
ogy should be used only when it fits appropriately within the 
theory of change and enhances the underlying mechanisms of 
professional development in a positive way to maximize the 
possibility that teachers become more effective teachers for 
struggling readers. The delivery system of the Targeted Read- 
ing Intervention used a professional development program that 
targeted struggling readers. The TRI included the use of literacy 
coaching weekly through webcam technology that allowed the 
immediate feedback that is accomplished through live one-on- 
one literacy coaching. That is, coaches thousands of miles away 
were able to see and hear the classroom teacher as she provided 
reading instruction to a struggling reader and the coach could 
give the teacher real time feedback on practices as well as 
problem solve about the best strategies to use with a particular 
struggling reader. This webcam approach to help classroom 
teachers instruct their struggling readers could help avoid the 
need for a specialized teacher to implement remedial reading 
programs (Amendum et al., 2011). Using webcam technology 
may also be more cost effective and feasible in rural areas 
where geographic isolation may prevent access to high quality 
professional development (Vernon-Feagans et al., 2012; Pro- 
vasnik et al., 2007). 

Thus, the intervention described in this study (The Targeted 
Reading Intervention) was developed in order to provide class- 
room teachers with particularly effective reading strategies for 
struggling readers in early elementary school. These strategies 
were implemented with the help of a literacy coach who worked 
with the teacher so she learned the strategies in instructional 
one-on-one diagnostic teaching sessions so the teacher could see 
the progress of individual struggling readers. Technology was used 
that allowed the literacy coaches to see and hear the teachers in 
these one-on-one sessions and give real time feedback to maximize 
teacher instructional change. Within our more elaborated model of 
teacher change, we included coaches who scaffolded the experi- 
ence of teachers as the teacher worked individually with one 
struggling reader in the hope of changing the way the teacher 
delivered instruction to struggling readers. Thus, teacher experi- 
ence of being coached and working with one child at a time has 
been hypothesized to be one mechanism for improving effective 
teacher instruction (Morgan, Timmons, & Shaheen, 2006; Risko et 
al., 2008). 


Summary 


The TRI has a number of unique elements that together may 
create the most effective instruction for struggling readers within 
the regular classroom setting. First, unlike many other interven- 
tions, the TRI uses the classroom teacher to deliver the interven- 
tion to each individual struggling reader through efficient, diag- 
nostic one-on-one instructional sessions. Second, the TRI iterative 
process of the teacher working with one struggling reader at a time 
helps the teacher understand and experience the success as she sees 
the struggling reader make rapid gains. Third, the TRI uses an 
innovative, web-based, collaborative coaching model. Biweekly, 
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each TRI teacher uses a laptop computer with a webcam in her 
classroom so that she can see and hear her literacy coach and the 
coach can see and hear her working with an individual struggling 
reader. Real time feedback and problem solving can be employed 
during these live sessions for individual children. 

In this study, we sought to examine whether the TRI could 
accelerate struggling readers’ early literacy skills so that they not 
only made significant progress across a year but that they began to 
catch-up to their non-struggling classroom peers. This required 
examination of struggling and non-struggling readers’ perfor- 
mance in the intervention schools relative to each other and to 
struggling and non-struggling readers in the control schools. The 
research questions were the following: (1) Do struggling readers 
who participate in TRI demonstrate better performance on tests of 
early literacy at the end of a school year than struggling readers 
who do not participate in TRI?; (2) When compared to struggling 
readers in control schools and to non-struggling classroom peers, 
does the spring performance of struggling readers in the interven- 
tion schools indicate that they are catching up to their non- 
struggling classroom peers? 


Method 


Setting 


Sixteen rural schools from five poor rural counties in different 
regions of the United States participated in the study, including 
schools in Texas, New Mexico, Nebraska, and North Carolina. All 
kindergarten and first grade classrooms in each school partici- 
pated. Schools within each school district were pair matched on the 
following: percentage of free and reduced lunch, school size, 
percentage of minorities, and participation in Reading First. One 
member of each pair was randomly selected to be the experimental 
school. Difficulties with accessing the Internet led to the with- 
drawal of one small experimental school that contained one kin- 
dergarten and one first grade classroom. The 15 remaining partic- 
ipating schools included 75 kindergarten and first grade 
classrooms and 631 students. All schools received Title I funding. 


Participants 


The demographics of the 631 children who participated in the 
fall assessments are described in Table 1. Since all schools were in 
low-wealth counties, the reported maternal education of these 
children was generally just beyond high school. Approximately 
50% of the children were from minority backgrounds, and half 
were boys. Teacher demographics are shown in Table 2 and are 
consistent with literature on rural schools. Teachers had more 
years of experience than reported for urban teachers, with an 
average of 15 years of teaching experience (Lee & Burkham, 
2002). 

Within each experimental and control classroom, teachers iden- 
tified children who were struggling and non-struggling readers 
with the help of the TRI literacy coach, mandated state assessment 
data and classroom performance within 2 months of the beginning 
of the school year. Teachers then rated all the children in the class 
as to whether they were profiting from regular classroom instruc- 
tion in reading and were on grade level. Based on this information, 
five struggling readers were randomly selected from those children 
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rated as significantly below grade level in reading, and five non- 
struggling children were randomly selected from those children 
who were rated as on or above grade level in each classroom. 
Because of a variety of permission and attrition factors, there were 
approximately nine children who participated in each classroom. 

We defined four groups of children for analysis purposes and to 
test hypotheses about the effectiveness of the TRI: struggling 
readers in experimental intervention schools (SRI), non-struggling 
readers in the same experimental intervention schools (NSD), strug- 
gling readers in non-intervention control schools (SRC), and non- 
struggling students in the same non-intervention control schools 
(NRC). In experimental intervention schools, there were 192 strug- 
gling readers (SRI) and 203 non-struggling readers (NSI). In the 
control schools, there were 107 struggling readers (SRC) and 129 
non-struggling readers (NRC). 


The Targeted Reading Intervention Using Webcam 
Coaching 


The main objective of the overall Targeted Reading Intervention 
(TRI) was to help the classroom teacher acquire key reading 
diagnostic strategies to promote rapid reading gains in K—I strug- 
gling readers through a technology driven professional develop- 
ment program that included ongoing biweekly coaching from a 
literacy consultant. The coaches all had extensive experience as 
teachers and/or reading coaches in early elementary school. Most 
were doctoral students in the School of Education. These coaches 
went through an intensive training that included videotaping them- 
selves working with individual children and receiving feedback 
from the intervention director of the project. Finally, coaches were 
given feedback throughout the academic year with respect to the 
challenges of working with teachers who were not always moti- 
vated to implement the TRI, coincident with current literature on 
coaching (Al Otaiba, Hosp, Smartt, & Dole, 2008). 

All teachers in the experimental group received a 3-day summer 
workshop to learn the TRI strategies and to practice them. The 
intervention director and the trained reading coaches led the 3-day 
institute. During the year, a literacy coach used cost effective 
webcam technology to meet with the teacher for about 20 min 
every 2 weeks over the instruction of an individual struggling 
reader. When the student made rapid progress, the student was 
transitioned to a small group, and another child was chosen to 
work one-on-one with the teacher. Through this webcam technol- 
ogy, the literacy coaches could help the classroom teacher use the 
TRI strategies effectively with each struggling reader in real time, 
help decide when a student was ready to be transferred to a small 
group session, and problem solve about students who were not 
making rapid progress. In addition, the literacy coach also met 
with each school team for 30 min bi-weekly through webcam 
technology to further reinforce the strategies and problem solve 
about individual children. Finally, workshops were also provided 
to the teachers every few months via webcam to support their 
developing understanding of the TRI process, models, and strate- 
gies. The TRI protected website contained all the training videos, 
instructions, and manuals that could be downloaded by teachers 
and links to downloadable books and so forth. 

During the school year, the teachers implemented the TRI in 
15-min one-on-one sessions with a struggling reader that included 
the following three parts each day: 
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Table 1 


Child Demographics and Achievement Scores 
er eee ee Ce Rh art fe tere MEL IP A ee 




















Kindergarten First grade : 
Variable Statistic NSC SRC NSI SRI NSC SRC NSI SRI 
Male N 59 Sy) 94 90 70 50 109 102 
% 0.59 0.63 0.55 0.54 0.50 0.64 033 0.54 
White N 59 57 94 90 70 50 109 102 
% 0.59 0.39 0.52 0.54 0.57 0.40 0.47 0.45 
Maternal education N 56 53 91 81 67 47 101 95 
M 14.18 12.64 13.63 13.06 13.67 13515 13k52 12.99 
SD (Oe 225) DAS 3S 2.16 2.03 2.40 OE 
Fall PPVT-III N 53 54 94 90 67 48 109 102 
M 102.81 94.09 98.67 91.50 98.79 88.54 98.14 91.18 
SD 13.44 14.91 12.92 14.98 14.71 15.43 13.16 12.76 
Fall WA score N 58 7 94 89 69 50 107 102 
M 426.41 411.12 431.16 409.67 468.06 450.96 470.28 455.02 
SD 18.66 18.87 22.01 23.11 1921) 18.99 17.03 19.76 
Fall LW score N 59 ol 94 90 68 50 109 102 
M S729 353.88 376.90 354.60 424.90 401.30 431.06 406.10 
SD 20.95 20.36 22.45 22.45 27.24 19.02 25.63 ysl 
Fall PC score N ao) Dil; 94 90 69 50 109 102 
M 411.88 403.54 409.29 402.66 449.48 431.24 454.93 431.38 
SD LES 13.31 18.98 13.60 26.54 17.34 20.85 20.19 
Fall SS score N Be 57 89 88 70 50 109 102 
M 464.76 445.61 468.22 448.78 490.50 483.50 491.36 484.65 
SD 18.72 18.67 14.87 17.61 8.32 11.81 7.54 one 
Spring PPVT N 55 Sil 88 84 69 45 103 oy 
M 105.15 96.63 100.50 95.62 103.29 94.38 100.51 91.67 
SD 17.12 12.68 13.85 11.50 Aa) 15.42 15.44 14.61 
Spring WA score N 55 51 88 84 69 45 103 92 
M 460.38 449.10 465.24 456.81 482.64 466.73 484.93 474.21 
SD 17.18 20.83 17ESS 21.41 SES) 17.02 19.05 16.77 
Spring LW score N a5 51 88 84 68 42 103 OD 
M 408.78 389.71 418.39 403.04 458.49 434.40 463.65 442.90 
SD 21.09 16.21 22.70 DES 8, 23571 20.63 22.05 19.43 
Spring PC score N a5) Si 87 84 69 45 103 92 
M 435.96 417.04 445.06 428.60 472.84 456.07 475.01 461.92 
SD 22.73 18.64 22.09 21.38 13.09 17.21 12.43 14.45 
Spring SS score N a) 51 88 84 69 45 103 92 
M 484.44 477.63 489.73 483.93 498.75 491.93 497.18 494.36 
SD 10.41 13.83 6.87 10.70 eyla 8.92 8.45 6.24 





Note. NSC = non-struggling readers in control schools; SRC = struggling readers in control schools; NSI = non-struggling students in intervention 
schools; SRI = struggling readers in intervention schools; PPVT-III = Peabody Picture Vocabulary Test—III; WA = Word Attack; LW = Letter Word 


Identification; PC = Passage Comprehension; SS = Spelling of Sounds. 


1. Re-reading for fluency. The teacher asks the student to 
re-read a selection that she/he has read at least once in the recent 
past for the purpose of developing reading fluency. The teacher 
might model fluent reading with some of the text, depending on the 
skill level of the child. This is done even with children who are 
non-readers through scaffolding and modeling. For example, ask- 
ing the child where to start reading and identifying initial sounds 
in words can be a way to help a beginner be successful, even when 
they have extremely limited alphabetic knowledge. 


2. Word work. ‘This innovative approach provides the teacher 
with a variety of assessment-based multi-sensory instructional 
strategies for helping the child manipulate, say, and write words. In 
the early stages, there are four major strategies that are employed 
using a white board and letter sounds (letter combinations) tiles to 
help children make words and to see, hear, and manipulate differ- 
ences between words. These four strategies were adapted to four 
major levels of child skill in reading and writing words. Level | of 
Word Work was geared to children who had almost no knowledge 
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Table 2 
Demographics of the Teachers 
Treatment Control 
(n = 43) (n = 32) 
Variable WE SDV afl Mh SID 
Race 
Black/African American 6 4 
White/European American 35 25 
Other 1 3 
Missing 1 
Gender 
Female 41% 32 
Age 
20-29 6 8 
30-39 11 8 
40-49 12 8 
50-59 11 7 
60+ 5 1 
Certification level 
Elementary education certified 40 28 
Master’s degree or higher 10 22 
Experience 
Total years teaching 17.60 10.77 Seis} Oo) 
Total years teaching current grade S95 E88 SO S.95 
Total years teaching at current 
school 8.00 5.59 7.45 8.00 
Total years teaching in current 
county 12530 8.93 9.45 8.56 


* One teacher not reporting gender, and one male teacher in experimental 
schools. 


of the alphabetic principle and focused on three-sound words with 
short vowels. Level 2 of Word Work was geared to slightly more 
advanced knowledge of the alphabetic principle and introduced 
children to four-sound words. The third level of Word Work 
allowed children more advanced phonics work with long vowel 
sounds that can be represented by a variety of vowel constellations. 
The fourth and final level of Word Work focused on multi-syllabic 
words. 

Along with the help of their literacy coaches, teachers made 
decisions about when to progress to more challenging levels of 
word identification and adopt slightly different strategies. The 
graphic organizer for the teacher helped her understand the four 
levels and the key diagnostic criteria to place a child within these 
skill levels. Thus, teachers learned to assess the child’s level of 
word identification and select a particular diagnostic strategy that 
matched the skill level of the child to achieve instructional match 
(Bear, Invernizzi, Templeton, & Johnston, 2003; Beck, 2006; 
Connor, Morrison, Fishman, Schatschneider, & Underwood, 2007; 
Connor et al., 2009; Morris, Tyner, & Perney, 2000). All TRI 
strategies demonstrate the alphabetic principle, help students learn 
phoneme-—grapheme (sound—symbol) relationships, develop stu- 
dents’ segmenting and blending abilities (phonemic awareness 
tasks), and help students recognize sight words. The four primary 
strategies are (1) Segmenting Words; (2) Change One Sound; (3) 
Read, Write, and Say; and (4) Pocket Phrases. 

Segmenting Words helps children to acquire knowledge about 
the sounds in simple, but progressively more difficult words, by 
allowing the child to use the letter sound tiles to build words by 
saying and moving each tile. For example, a child with limited 
alphabetic knowledge would begin at the lowest level, targeting 
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words that can be made with the following few letters: a, s, m, t, 
and p. Three-letter words with beginning consonants, such as “sat” 
or “mop” and not “top,” would be chosen because these types of 
words are the easiest to teach phoneme segmentation since the 
teacher can stretch out the sounds more easily. For example, the 
teacher might place three letter-sound cards (t, m, a) at the top of 
the Word Work board and say, 


Hannah, I want you to help build a word right here [tapping lines on 
bottom of board]. The first word is “mat,” /mmmmat/ [as she drags 
her finger along the three short lines at the bottom of the board in 
concert with the sounds she is making]. I wiped my feet on the mat. 
What sound do you hear here [pointing to Word Work board] in the 
word /mmmat/? 


With feedback, the child will progress and this allows the teacher 
to gradually progress to more challenging words. 

Change One Sound helps children contrast the sounds between 
words by placing selected letter sound tiles in front of the child and 
helping the child to make a word like “map” from the letter sound 
tiles and then asking him to change that word to “mop” by 
replacing the “a” tile with the “o” tile. As children become more 
proficient in this strategy, teachers may focus on medial, beginning 
or ending sounds and always in the contexts of real words. 

Read, Write, and Say helps children to read new words and write 
those words on the white board. As children write the words they 
also say the words again. 

Pocket Phrases helps children to remember sight words/phrases 
by writing these words/phrases on cards and asking them to show 
and read to others in their class or at home. 

These four strategies, along with other more advanced strate- 
gies, can be used with TRI instructional levels that gradually 
expose the student to more and more alphabetic complexity, keep- 
ing him/her challenged. After each session, the teacher then goes 
back to her diagnostic map and develops a plan for the child’s next 
session. 

3. Guided Oral Reading (GOR). Strategies are employed in 
a text chosen at the child’s instructional reading level, as guided by 
the Word Work sessions and Diagnostic Map. Teachers pay par- 
ticular attention to scaffolding children’s abilities to summarize, 
predict, make connections, and inferences. Vocabulary words that 
may be difficult are defined and a picture dictionary is available 
during this part each session. During a book-reading session the 
teacher might ask for the child to define a word, to answer what 
might happen next, or to answer a causal question about the 
storyline. Having children orally summarize the story at the end 
helps the teacher understand if the child truly understood the book 
as well as whether the child understands the conventions of sto- 
rytelling. Having teachers ask concrete and abstract questions 
about the story also can help them understand whether the child 
understands the nuances of the story and help them understand 
whether the child understands what is demanded by different 
levels of questions. 


Data Collection and Measures 


All children in the study were administered a battery of stan- 
dardized tests in the fall and again in the spring of the school year. 
Teachers filled out questionnaires about their professional back- 
ground and classroom. All child assessments were done in the 
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schools in a quiet room. Trained graduate students or former 
teachers conducted the child assessments. The assessors partici- 
pated in a 2-day training, which included the administration of the 
complete battery with non-participating students. Assessors were 
not informed which schools were experimental or control. The 
following measures were administered to children in the fall and 
the spring. 

Four subtests of the Woodcock—Johnson III Diagnostic Reading 
Battery (WJ-DRB II; Woodcock, Mather, & Schrank, 2004) were 
administered to all children. Word Attack measures skill in apply- 
ing phonic and structural analysis skills to the pronunciation of 
unfamiliar printed sounds and words. The initial items require the 
child to produce sounds for single letters. The remaining items 
require the child to read aloud letter combinations that are pho- 
netically consistent, or regular, patterns in English orthography but 
are non-words or low-frequency words. The items become pro- 
gressively more difficult. Word Attack has a median reliability of 
.87 in the 5-19 age range (Woodcock et al., 2004). 

Letter Word Identification measures the child’s word identifi- 
cation skills. The initial items require the child to identify letters 
that appear in large type, and the remaining items require the child 
to pronounce words correctly. The items become increasingly 
difficult as the selected words appear less and less frequently in 
written English. Letter Word Identification has a median reliability 
of .91 in the five-to-19 age range (Woodcock et al., 2004). 

Passage Comprehension initial items measure symbolic learn- 
ing and require the child to match a rebus with an actual picture of 
an item. The more advanced items employ a modified cloze 
procedure that requires the child to read a short passage and 
provide a missing key word which makes sense within the context 
of the passage. The items become increasingly difficult by remov- 
ing pictorial support and by increasing passage length and diffi- 
culty as well as vocabulary complexity. Passage Comprehension 
has a median reliability of .83 (Woodcock et al., 2004). 

Spelling of Sounds measures the child’s spelling ability, in 
particular, phonological and orthographical coding skills. Initial 
items require the child to write single letters for sounds. Remaining 
items require the child to spell letter combinations that are regular 
patterns in English. Items increase in difficulty by requiring more 
complex spelling patterns. Spelling of Sounds has a median reli- 
ability of .74 (Woodcock et al., 2004). 

The Peabody Picture Vocabulary Test—III (PPVT-IL; Dunn & 
Dunn, 1997) is an individually administered, norm-referenced test 
of receptive vocabulary knowledge. Children are asked to select a 
picture that best represents the meaning of the stimulus word 
presented orally by the examiner. Alpha coefficients for the 
PPVT-III for elementary age students range from .92 to .95 (Dunn 
& Dunn, 1997). 


Fidelity of Implementation 


To assess fidelity of implementation of the TRI, classroom 
teachers reported exposure of each target child to the TRI as well 
as the teachers’ adherence to the elements of the TRI to an on-site 
facilitator during the biweekly team meetings with the literacy 
coach. The teachers then entered the data online. The Fidelity data 
are summarized in Table 3 across kindergarten and first grade 
since there were no grade level differences on any of the fidelity 
measures. 
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Exposure was measured by the number of weeks that each child 
received the TRI over the course of the year and the total number 
of sessions per week. Each of the five target children received 
one-on-one TRI intervention an average of 6 weeks with 2.4 
sessions per week for a total of about 14 sessions per child. 

Adherence to the TRI was measured by the number of reported 
weeks that each of the three parts of the TRI were implemented 
with each child: Re-Reading for Fluency, Word Work, and Guided 
Oral Reading. Teachers reported that 80% of the week’s Re- 
Reading for Fluency was implemented, 96% of the week’s Word 
Work was implemented, and 92% of the week’s Guided Oral 
Reading was implemented. 


Results 


Missing Data Methods 


More than 85% of the sample participated in fall and spring 
assessments and provided demographic background information. 
To avoid imprecise regression estimation due to missing data, we 
created and analyzed multiple imputed data sets in SAS Version 
9.1. Multiple imputation procedures use an iterative (chained equa- 
tions) method to estimate the multivariate relations among study 
variables for cases with available data. These observed relations 
among study variables are then used to estimate plausible values 
for missing data. Creating multiple data sets with plausible values 
for missing data and aggregating solutions from analyses using 
multiple data sets provides the best approximation of relations 
among variables given no missing data (Graham, Olchowski, & 
Gilreath, 2007; Schafer & Graham, 2002). Consequently, the anal- 
ysis of variance (ANOVA) and analysis of covariance (ANCOVA) 
models presented below were run on each of 20 imputed data sets, 
and model parameters were aggregated across the data sets using 
the PROC MIANALYZE function in SAS. The imputation model 
included the following: fall and spring assessment scores for all 
outcomes, child grade, child race (White, Black), child gender, 
mother’s education, and dummy variables indicating school iden- 
tification and randomized treatment status. 


Preliminary Analysis 


Before testing intervention effects on student literacy outcomes, 
we conducted preliminary analysis to verify the validity of teach- 
ers’ identification of struggling readers. To validate teachers’ 
identification, we compared fall scores on all outcomes of interest 
for struggling readers and non-struggling readers in the sample. 


These models were estimated in SAS 9.1 as three-level ANOVAs 


Table 3 
Fidelity of Implementation 

Variable N M SD 
Total number of weeks of TRI Key KO S17) 
Number of sessions per week of the TRI G72 OO) 
Proportion of weeks Re-Reading for Fluency done 167 0.83 0.25 
Proportion of weeks Guided Oral Reading done Mey Oseth O23) 
Proportion of weeks Word Work done Nor WS OOo 


Note. Teachers with intermittent reporting of fidelity were dropped from 
the fidelity analysis. TRI = Targeted Reading Intervention. 
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accounting for the nesting of students with classrooms and class- 
rooms within schools. Three-level models predicted fall scores as 
a function of a four-category intervention group variable at Level 
2 (SRC, NSC, SRI, NSI). Follow-up contrasts of struggling versus 
non-struggling fall scores were conducted to test for mean differ- 
ences in performance before intervention. For all outcomes of 
interest, struggling readers scored significantly lower than non- 
struggling readers did before this intervention. We present the 
results of the tests of mean differences in Table 4. 


Tests of the Effects of the Intervention 


Analytic strategy. Multi-level (hierarchical) models were 
used to examine our questions about the effectiveness of the TRI 
for struggling readers. Separate models were conducted for each of 
five outcomes: Word Attack, Letter Word Identification, Passage 
Comprehension, Spelling of Sounds, and PPVT. All models were 
estimated in SAS Version 9.1 as a three-level ANCOVA account- 
ing for the nesting of students within classrooms and classrooms 
within schools. Effect sizes for significant treatment effects were 
calculated by dividing the contrast coefficient (mean difference) 
by the square root of total variation in the model. 

The three-level ANCOVA predicted spring scores as a function 
of fall pre-test scores as a fixed effect at Level 1, a four-category 
intervention group fixed effect at Level 2, and a set of level-one 
fixed effects used as covariates across all models: gender is male, 
mother’s years of education, grade (K = 0, first = 1), and race is 
White. This model estimated random effects for classroom and 
school intercepts. All covariates including the pre-test were cen- 
tered for analysis, so that the intercept in the models reflected 
average spring scores for the treatment reference group, that is, 
struggling readers in intervention schools. 

Intervention effects to answer Question | were established by 
testing the significance of the conditional mean difference between 
spring scores for struggling readers in intervention schools and 
struggling readers in control schools. 

In order to answer Question 2 about whether our experimental 
intervention children were progressing at the same rate as their 
non-struggling peers, we used the following rationale. Considering 
that the performance of students in control schools represented 
expected performance for students in experimental schools if in- 
tervention were not administered, we proposed that evidence for 
catch-up would be clearly established given four effects: (1) sig- 
nificant and positive intervention effects between the SRI and the 


Table 4 
Pretest Differences Between Struggling and Non-Struggling 
Readers 


Variable NS versus SR SE P 
Fall WA 16.79 1.56 <.0001 
Fall LW Zits 1.67 <.0001 
Fall PC 13.37 1.49 <.0001 
Fall SS IDET 1.06 <.0001 
Fall PPVT-III 7.43 0.98 <.0001 


Note. NS = non-struggling readers; SR = struggling readers; WA = 
Word Attack; LW = Letter Word Identification; PC = Passage Compre- 
hension; SS = Spelling of Sounds; PPVT-III = Peabody Picture Vocab- 
ulary Test—III. 
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SRC; (2) non-significant differences between conditional spring 
scores for struggling readers in intervention schools (SRI) and 
non-struggling readers in intervention schools (NSTI; i.e., control- 
ling for fall scores students in intervention schools are changing at 
a similar rate regardless of struggling status); (3) significant dif- 
ferences between conditional spring scores for struggling readers 
in control schools (SRC) and non-struggling readers in control 
schools (NSC; i.e., non-struggling readers are changing at a greater 
rate than struggling readers are in control schools); and (4) non- 
significant differences between conditional spring scores for non- 
struggling students in intervention schools (NSI) and non- 
struggling readers in control schools (NSC; i.e., no evidence that 
non-struggling students in intervention schools are lagging com- 
pared to non-struggling students in control schools). These con- 
trasts are presented in Table 5. 
The reduced form equation for the model is as follows: 


Vix = Yooo + Yi00 (pre-test) ;;x’Yo20 (treatment), + Y309 (male); 
+t Y400 (mother’s education) jx cts Y500 (grade), ath Y600 (White) x 
+ Upon + Tojk + ijk 


In this notation, fixed effects are represented by gammas (vy), 
and random effects are reflected in two error terms: a term for 
Level 3 variation between schools (uUgo,) and a term for Level 2 
variation between classrooms in schools (r9,,). In exploratory 
models, we tested a Grade X Treatment interaction. Because that 
interaction was not significant for any of the outcomes, we ex- 
cluded it from the final models. Results from the multi-level 
ANCOVA appear in Table 5. Formal tests of treatment main 
effects are represented twice in the table: first as the coefficient for 
the struggling control group listed in the fixed effects (SRC) and 
second as the formal contrast of struggling intervention (SRI) and 
struggling control (SRC) conditional means. 

The table contains fixed effects, variance components, and 
group contrasts obtained through estimate statements for each of 
five outcomes. Within Table 5, we provide variance components 
for each model. Significance tests of the variance components 
indicated that the variation between schools was not significantly 
different from zero for any of the five outcomes. The variation 
between classrooms within schools was significantly different 
from zero for Letter Word Identification (LW), Passage Compre- 
hension (PC), and PPVT-III only. 

Word Attack. Controlling for differences in pre-test scores, 
TRI had a positive effect on struggling readers’ Word Attack skills. 
Spring scores for struggling readers in intervention schools were 
5.65 higher than scores for struggling readers in control schools 
(b = 5.65, p = .04). There was some evidence that TRI promoted 
catch-up for Word Attack skills. Non-struggling readers in inter- 
vention schools did not outperform struggling readers in interven- 
tion schools controlling for fall performance (b = —0.55, p = .75). 
However, non-struggling readers in control schools did outperform 
struggling readers in those schools, controlling for fall scores (b = 
—5.13, p = .02). There was no evidence that non-struggling readers 
in intervention schools underperformed relative to non-struggling 
readers in control schools (b = 1.08, p = .64). Above and beyond 
intervention effects, male students had lower spring Word Attack 
scores, and more maternal education was associated with higher 
spring scores. 
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Table 5 
HLM Intervention Effects and Planned Comparisons 
LW Re 
Fixed effects B SE P d B SE Pp d B SE p d 
Pretest 0.45 0.03 <.0001 0.64 0.03 <.0001 0.32 0.03 <.0001 
White 0.55 1.45 fil 0.66 1.42 .64 B33) 1.47 .02 
Male —4.32 DE .00 —3.64 125 .00 —2.60 132 05 
Maternal education 0.73 0.31 .02 0.64 0.31 .04 1.45 0.32 <.0001 
Grade 0.81 2 all 70 11.43 2.24 <.0001 22.92 2.06 <.0001 
NSC OS) 2.68 .84 SOS) aa 99 3.65 2.81 .20 
SRC —5.66 De, 04 —8.39 3.20 Ol =F 2.88 01 
NSI 0.55 1.68 74 2.38 1.64 15 8.29 1.63 <.0001 
Variance components 
Level 3 variation 10.44 8.81 24 19.54 12.07 11 13.31 10.59 val 
Level 2 variation 16.73 9.01 06 21.42 OWS, .03 18.21 8.55 .03 
Level | variation 217.69 13.89 <.0001 198.31 12293 <.0001 216.55 13.76 <.0001 
Contrasts 
SRI vs. SRC 5.66 212, 04 0.36 8.39 3.20 Ol 0.54 7.05 2.88 01 0.48 
SRI vs. NSI =055 1.68 75 0.04 meeo8 1.64 al 0.15 —8.29 1.63 <.0001 0.53 
SRC vs. NSC Seal, Dal .02 0.33 —8.36 2.09 <.0001 0.54 = lleSO 2.11 <.0001 0.72 
NSI vs. NSC 1.08 2.62 .64 0.07 2.41 3.05 43 0.16 4.65 2.77 .09 0.29 
SS PPVT-III 
Fixed effects B SE Dp d B SE Dp d 
Pretest 0.31 0.02 <.0001 0.65 0.04 <.0001 
White 0.78 0.71 Deh 2.92 1.00 00 
Male Sale) 0.62 .05 0.78 0.84 35 
Maternal education 0.50 0.15 .00 0.41 0.22 07 
Grade 2.06 1.03 .05 0,37) 1.36 79 
NSC —0.98 1.30 AS 4.01 2.03 .05 
SRC e238 1633, .02 1.34 Pale, ES 
NSI = On2, 0.83 .89 2.67 1.01 01 
Variance components 
Level 3 variation 2.63 oo A) 525i] 5.21 29 
Level 2 variation 2.78 1.79 el 22.85 6.49 .00 
Level | variation 51.01 B25) <.0001 86.48 Soi) <.0001 
Contrasts 
SRI vs. SRC 3.23 1.33 .02 0.63 cael ES) Dal3) 46 0.15 
SRI vs. NSI 0.12 0.83 .89 0.02 eS 0.98 .02 0.22 
SRC vs. NSC 229 1.05 .03 0.44 —2.60 1.30 .05 0.24 
NSI vs. NSC 0.87 1.26 49 0.17 —1.80 2.08 oo 0.17 
Note. Bolded ds are significant effect sizes. HLM = hierarchical linear modeling; WA = Word Attack; LW = Letter Word Identification; PC = Passage 


Comprehension; NSC = non-struggling readers in control schools; SRC = struggling readers in control schools; NSI = non-struggling students in 
intervention schools; SRI = struggling readers in intervention schools; SS = Spelling of Sounds; PPVT-III = Peabody Picture Vocabulary Test—Ill. 


Letter Word Identification. 


Controlling for differences in 


pre-test scores, TRI had a positive effect on struggling readers’ 
Letter Word Identification skills. Spring scores for struggling 
readers in intervention schools were 8.39 points higher than 
spring scores for struggling readers in control schools (b = 
8.39, p < .01). Again, there was evidence to that suggest that 
struggling readers who participated in TRI were beginning to 
catch up to their non-struggling peers. Struggling readers in the 
intervention schools made gains in Letter Word Identification 
skills that did not differ significantly from gains made by their 
non-struggling classroom peers (b = —2.38, p = .15). On the 
contrary, spring performance for struggling readers in control 
schools was lower than their non-struggling classmates’ perfor- 
mance (b = —8.36, p < .0001). There was no evidence that 
non-struggling students in experimental schools underper- 
formed relative to non-struggling students in control schools 
(b = 2.41, p = .43). Above and beyond intervention effects, 


male students had lower Spring LW scores, first graders had 
higher spring LW scores, and higher maternal education was 
associated with higher spring scores. 

Passage Comprehension. Controlling for differences in pre- 
test scores, TRI had a positive effect on struggling readers PC 
skills. Spring scores for struggling readers in intervention schools 
were approximately 7.65 points higher than spring scores for 
struggling readers in control schools (p = .008). However, there 
was not strong evidence that TRI promoted catch-up. Struggling 
readers in intervention schools made less gain than their non- 
struggling classmates did (b = —8.29, p < .0001), and struggling 
readers in control schools made less gain than their non-struggling 
classmates did (b = —11.30, p < .0001). Above and beyond 
intervention effects, students with higher maternal education, 
white students, and first graders had higher spring PC scores. 

Spelling of Sounds. Spring scores for struggling readers in 
intervention schools were 3.23 points higher than spring scores for 
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struggling readers in control schools (p = .02). Struggling readers 
in intervention schools gained at the same rate as their non- 
struggling classmates as evidenced by a non-significant difference 
in spring performance (b = 0.12, p = .88). On the contrary, spring 
scores for struggling readers in control schools were lower than 
those of their non-struggling classmates (b = —2.25, p = .03). 
There was no evidence that non-struggling readers in intervention 
schools underperformed relative to non-struggling readers in con- 
trol schools (b = 0.87, p = .49). Above and beyond intervention 
effects, higher maternal education was associated with higher 
spring Spelling of Sounds (SS) scores, and first graders had higher 
spring scores. 

PPVT-III. There was no evidence that TRI had a positive 
effect on PPVT-III skills (b = -1.59, p = .46). There was 
evidence that spring performance for struggling readers in control 
schools differed from non-struggling readers in those schools (b = 
—2.38, p = .02). There was no evidence that non-struggling readers 
in intervention schools underperformed relative to non-struggling 
readers in control schools (b = —1.80, p = .39). Above and beyond 
intervention effects, white students had higher spring PPVT-UI 
scores. 


Discussion 


The results from this study, using webcam technology to coach 
classroom teachers to individualize reading instruction for strug- 
gling readers, suggested that the TRI can significantly help strug- 
gling readers progress more quickly across a broad range of 
reading skills, including basic word reading, spelling, and passage 
comprehension skills over 1 year in comparison to children who 
did not receive this intervention. Furthermore, there was evidence 
that across some of these achievement measures in word reading 
and spelling of sounds the children in the TRI experimental group 
were able to progress at the same rate as their non-struggling peers. 
On the other hand, there was no evidence that the TRI could 
eliminate the gap between struggling and non-struggling readers in 
reading or improve receptive vocabulary over a one year period. 

This study was important in a number of ways. First, webcam 
technology appeared to be an effective and efficient method to 
deliver professional development to remote rural schools using 
webcam coaching for classroom teachers. This study is one of only 
a few that has used technology to deliver professional development 
to classroom teachers (Mashburn et al., 2010; Powell et al., 2010) 
and the only one to use live webcam literacy coaching for class- 
room teachers that provided live and immediate feedback on 
teaching practices. In previous recent studies, teachers were given 
feedback on their teaching practices through delayed feedback 
from literacy consultants viewing videotapes of teachers’ instruc- 
tional practices. Moreover, as we have mentioned before, the live 
webcam sessions where the coach could see and hear the teacher 
working with a struggling reader and give real-time feedback were 
also likely critical in helping the teacher implement the interven- 
tion more quickly. In addition, the use of webcam technology 
allowed the teacher to have control of when the sessions took place 
and created efficiencies for the time allotted for the coaching 
sessions. For the current study, a half-time doctoral student could 
coach up to 12 classroom teachers at a time, given the flexibility 
afforded by the use of this technology. Although there was no 
cost/benefit analysis attempted in this study, there is no doubt that 
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the webcam technology was very affordable. Laptop computers 
were inexpensive at about $800 a piece, and iChat and Skype were 
free to the users. The webcam technology could be implemented in 
almost any school since even remote rural schools have adequate 
Internet access. Thus, webcam technology as a tool for profes- 
sional development builds on the previous work of others who 
have used videotape technology to provide professional develop- 
ment (Mashburn et al., 2010; Pianta et al., 2008; Powell et al., 
2010). 

Second, the TRI produced larger effect sizes than other studies, 
especially given that the classroom teacher was the one imple- 
menting the intervention. Across a broad range of reading assess- 
ments, the TRI produced effect sizes from .36 to .63 for kinder- 
garten and first grade children over a 1-year period compared to 
children who did not receive the intervention. The current study 
had strong effects on both word level reading and reading com- 
prehension, with a strong effect size of .48 on reading comprehen- 
sion, even after accounting for school and classroom variance as 
well as maternal education, gender, and race. In addition, on two 
of the four reading measures, the TRI experimental children pro- 
gressed at the same rate of growth in reading as the non-struggling 
readers in the same classrooms. Most other successful reading 
programs have been able to improve word reading skills over 1 
year but many found no or small effects on reading comprehension 
(Foorman, Francis, Fletcher, Schatschneider, & Mehta, 1998; 
Torgesen et al., 1999). For instance, in a review of 42 one-on-one 
early reading intervention programs for at risk students, it was 
found that the effect sizes for word reading and passage reading 
were the greatest, with effect sizes from .41 to .54, whereas the 
effect sizes for reading comprehension were a modest .28 (El- 
baum, Vaughn, Hughes, & Moody, 2000). In the current study, the 
effect sizes for word reading skills were .36 and .54, and the effect 
size for spelling was .63, certainly comparable to other studies. 
However, the effect size for reading comprehension in this study 
was almost double the average reported by Elbaum et al. (2000) 
across 42 intervention studies. This comprehension effect was 
probably due to the greater emphasis placed on reading compre- 
hension compared with many other early reading interventions that 
often focus on improving children’s decoding skills. The TRI 
emphasized not only word reading skills but comprehension skills 
during both Word Work and Guided Oral Reading, which made 
sure children could define the words they were reading, summarize 
stories they read, and answer complex questions about the texts, 
including causal and prediction questions. 

In addition, previous studies that have used technology to help 
the classroom teacher, using video feedback on practices, have 
found positive effects, but the effect sizes in these studies were 
considerably smaller than reported in the current study. For in- 
stance, the most recent studies, using video feedback on language 
and literacy practices to help teachers in preschool (Mashburn et 
al., 2010; Powell et al., 2010), reported effect sizes for the children 
in the study of .1-.29. Although these were significant and impor- 
tant, the effect sizes in the current study for both word level 
reading and reading comprehension were .31-.63. These findings 
may suggest that future work consider webcam technology for 
feedback to teachers as the most effective way to get gains in 
reading for struggling readers. 

Third, and a particularly important finding from this study, was 
the fact that the classroom teachers implemented successfully an 
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intervention for struggling readers with the help of literacy 
coaches. Most successful reading interventions for struggling read- 
ers have either used a specialized teacher to deliver the interven- 
tion outside the regular classroom or they have found effects only 
on word level reading (Foorman et al., 1998; Hurry & Sylva, 2007; 
Torgesen et al., 2001, 1999). This method of coaching classroom 
teachers appeared to be just as effective in helping struggling 
readers as employing one-on-one tutors. Elbaum et al. (2000) 
reported effect sizes for 42 tutoring interventions, with an average 
effect size of .41, which is comparable to the results in this study. 
Previous interventions using classroom teachers have not been 
particularly effective in helping struggling readers in early elemen- 
tary school as suggested by reviews of the literature (Risko et al., 
2008). Thus, this study is somewhat unique in not only demon- 
strating that the classroom teacher can implement effective instruc- 
tion for struggling readers but that the students gain on a broad 
range of reading measures, including both word level reading skills 
and reading comprehension. Although we do not have direct 
evidence from this study, we believe, like other studies have 
argued, that literacy coaching (Carlisle & Berebitsky, 2011; Inter- 
national Reading Association, 2004) allowed teachers to get real 
time feedback on individualizing instruction for particular strug- 
gling readers (Scanlon, Gelzheiser, Vellutino, Schatschneider, & 
Sweeney, 2008; Speece, Case, & Molloy, 2003) that in turn 
enabled the children to gain in early reading. 

Fourth, the teachers in this study were able to implement the 
TRI literacy strategies with relatively little training and with rel- 
atively modest instructional time per student. On average, teachers 
worked individually with a child two to three times per week for 
6 weeks, with an average of 14 sessions for each child over the 
course of the year. In programs like “Reading Recovery,” which 
used a specialized teacher, both more sessions and longer sessions 
were needed to achieve rapid progress (Elbaum et al., 2000; 
Schwartz, 2005). Even though our study used fewer resources and 
less time with individual children, the effect sizes for this study 
were comparable to those reported for a host of studies reviewed 
by Elbaum et al. (2000), all of which used a specialized teacher to 
deliver one-on-one intervention. Given the fewer resources avail- 
able in low-wealth rural schools, the possibility of using the 
classroom teacher as the vehicle to help prevent reading failure in 
struggling readers and doing so with not much time taken away 
from the instructional day may have many benefits that hopefully 
can be replicated in future studies. 


Limitations 


There are a number of limitations in this study. First, there was 
a small school that was dropped from the study because of prob- 
lems with technology. Even though there were only two class- 
rooms in this school, this lack of participation of the school 
compromises our ability to make strong causal inferences. Second, 
there was limited information on fidelity. We were able to docu- 
ment exposure and adherence of the implementation (O’Donnell, 
2008), but we were not able to measure the actual quality of the 
implementation. In future studies, it will be important to objec- 
tively observe the biweekly coaching sessions to more carefully 
document the quality of implementation beyond amount and ad- 
herence. This can now be done with the new capabilities to 
digitally record the iChat or Skype sessions. We believe that the 
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webcam coaching was the most important aspect of the interven- 
tion that led to change in student reading because of previous 
research that has demonstrated the importance of coaching (Al 
Otaiba et al., 2011; Carlisle & Berebitsky, 2011; Connor et al., 
2011, 2009), but since we did not have a study that separated the 
effect of the summer institute from the coaching, we can only 
speculate on the reasons for the TRI success. We do know from 
previous work that workshops and institutes do not seem to be 
enough to produce real change in teachers that leads to improved 
reading for children (Garet et al., 2008; McGill-Franzen et al., 
1999). Last, the TRI was not implemented long enough or in- 
tensely enough to allow the struggling readers to catch up with 
their non-struggling peers. Although the struggling readers were 
able to gain at the same rate as their non-struggling peers in three 
areas of early literacy when they received the TRI, the program 
was not able to allow the struggling readers to catch up with their 
non-struggling peers. It appears that future efforts may need pro- 
grams, like the TRI, to be implemented more often for each child 
over | year, or for struggling readers to be involved in the TRI over 
multiple years, to make sure most struggling readers can catch up 
to their non-struggling peers. 


Summary 


Even with these limitations, this study is one of the first to 
suggest that the regular classroom teacher can learn effective 
instructional strategies using webcam literacy coaching that can 
lead to significant early reading gains in struggling readers and 
hopefully prevent reading failure in subsequent grades. It appears 
that efficient webcam technology may have contributed to the 
effectiveness of TRI by providing the regular classroom teacher 
with easy access to live feedback in the regular classroom on an 
ongoing basis over the school year. This webcam technology may 
be particularly effective in delivering professional development to 
classroom teachers in rural schools because these teachers do not 
have easy access to professional development opportunities due to 
geographic isolation. Webcam coaching may also be effective not 
only for rural schools but a wide variety of schools and could also 
be used to deliver professional development in other content areas 
for improving the instruction of classroom teachers. 
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Using Electronic Portfolios to Foster Literacy and Self-Regulated Learning 
Skills in Elementary Students 


Philip C. Abrami, Vivek Venkatesh, Elizabeth J. Meyer, and C. Anne Wade 


Centre for the Study of Learning and Performance, Concordia University 


The research presented here is a continuation of a line of inquiry that explores the impacts of an electronic 
portfolio software called ePEARL, which is a knowledge tool designed to support the key phases of 
self-regulated learning (SRL)—forethought, performance, and self-reflection—and promote student 
learning. Participants in this study were 21 teachers from elementary schools (Grades 4—6) and their 
students (NV = 319) from 9 urban and rural English school boards in Quebec and Alberta, Canada, who 
participated during the 2008-2009 school year. Students with low enthusiasm for the use of ePEARL 
were excluded from the main sample as they exhibited different patterns in learning gains and self- 
regulatory skills as compared with those with high and medium enthusiasm. Multivariate analyses of 
covariance showed that students motivated to use the software made significantly greater gains compared 
with controls in 3 of 4 writing and reading skills (p < .01) as assessed by the constructed response subtest 
of the Canadian Achievement Test (fourth edition). Multivariate analyses of covariance of student survey 
data revealed that, over time, students who used the software reported higher levels of SRL processes 
than those in the control group (p < .01). Implications of the findings for school leaders and teacher 
educators regarding the use of electronic portfolios are discussed. 


Keywords: self-regulation, electronic portfolios, quasi-experimental research, elementary education, 


classroom research 


A recent international report on the performance of secondary- 
school students in Western, industrialized countries (Knighton, 
Brochu, Gluszynski, 2010; Organization for Economic Co- 
Operation and Development, 2010) found that a significant num- 
ber of students in every country lacked fundamental literacy, 
numeracy, and scientific reasoning skills. In addition, some stu- 
dents may lack the sophisticated strategies for learning how to 
learn, strategies that may be increasingly important in the knowl- 
edge age. Furthermore, these gaps in essential competencies and 
skills have substantial personal, social, and economic conse- 
quences. For example, Statistics Canada (Coulombe, Tremblay, & 
Marchand, 2004) estimated that a 1% increase in the rate of adult 
literacy in the Canadian population of about 30 million inhabitants 
would be worth $18.4 billion dollars annually to the Canadian 
economy. 

Contemporary trends in education research indicate that im- 
provements in educational success will occur when students be- 
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come more active, engaged participants in their learning, enhanc- 
ing the extent to which learning is personally meaningful (e.g., 
Tobias & Duffy, 2009). These trends also recognize that lifelong 
learning skills, or the ability to develop and use learning strategies 
and skills, are growing increasingly important in the knowledge 
age where content expertise is evolving rapidly. Put colloquially, 
the importance of knowing how is supplanting the importance of 
knowing what. 


Self-Regulated Learning 


Against this backdrop, there is increasing interest in theories and 
research on student-centered learning and their application to 
classroom practice. Self-regulated learners are individuals who are 
metacognitively, motivationally, and behaviorally active partici- 
pants in their own learning (Zimmerman, 2000; Zimmerman & 
Schunk, 2011). Zimmerman (2000) defines self-regulation as 
“self-generated thoughts, feelings, and actions that are planned and 
cyclically adapted to the attainment of personal goals” (p. 14). This 
implies not only behavioral skill management and subject knowl- 
edge but also metacognitive awareness, social influences, and 
motivational beliefs about personal agency. Zimmerman structures 
the self-regulation process in three phases: forethought, perfor- 
mance, and self-reflection. The forethought phase includes task 
analysis in the form of goal setting and strategic planning, and 
self-motivation beliefs in the form of self-efficacy, outcome 
expectations, intrinsic interest, and goal orientation. The per- 
formance phase is divided into self-control, which includes 
self-instruction, imagery, attention-focusing and task strategies, 
and self-observation, which includes self-recording and self- 
experimentation. The third phase is self-reflection. It includes 
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self-judgment, composed of self-evaluation and causal attribu- 
tion, as well as self-reaction, which involves self-satisfaction 
and adaptive/defensive responses. The process is described as 
cyclical because successful self-regulation depends on the con- 
stant monitoring and correction of performance based on feed- 
back about recent efforts. 

Zimmerman’s (2000) self-regulation model is a social-cognitive 
model that puts great emphasis on social, environmental, and 
personal influences in efforts to self-regulate effectively. In this 
model, using valuable social resources (such as peer or teacher 
feedback, modeling and emulation of expert behavior) or environ- 
mental support (such as self-rewarding achievement with a relax- 
ing activity) will result in successful self-regulation. 

There are numerous studies of the effectiveness of self-regulated 
learning (SRL) on academic achievement (e.g., Chung, 2000; Paris 
& Paris, 2001; Winne, 1995; Zimmerman, 1990; Zimmerman & 
Bandura, 1994; Zimmerman & Martinez-Pons, 1988) as well as on 
learning motivation (Pintrich, 1999). Furthermore, SRL is consid- 
ered by some as a key competence for lifelong learning (European 
Union Council, 2002). Considering these three areas—academic 
performance, motivation to learn, and learning strategies—where 
students can benefit from SRL, the potential value of SRL training 
programs becomes clear. Providing students with knowledge and 
skills about how to self-regulate their learning may help them to 
self-initiate motivational, behavioral, and metacognitive activities 
in order to control their learning (Zimmerman, 1998). Recent 
empirical work by Venkatesh and Shaikh (2008, 2011) and Shaikh, 
Zuberi, and Venkatesh (2012) has also demonstrated links between 
academic performance and specific self-regulatory processes, namely, 
task understanding and monitoring proficiencies. Our work takes into 
account these relationships by including both academic achievement 
and self-regulation measures in the analyses. 

In two related meta-analyses, Dignath and Buettner (2008) and 
Dignath, Buettner, and Langfeldt (2008) investigated the impact of 
various SRL training characteristics on academic performance, 
strategy use, and the motivation of students. The meta-analyses 
included 49 studies conducted with primary-school students and 
35 studies conducted with secondary-school students. Altogether, 
357 effect sizes were analyzed. 

For achievement outcomes, the average effect size for SRL 
training was + 0.61 for primary schools and + 0.51 for secondary 
schools. For cognitive and meta-cognitive strategy use, the average 
effect size for primary schools was + 0.72 and + 0.88 for 
secondary schools. For motivation outcomes, the average effect 
size for primary schools was + 0.75 and + 0.17 for secondary 
schools. For both school levels, effect sizes were higher when 
researchers conducted the training instead of regular teachers. 
Moreover, interventions attained higher effects when mathematics 
was the subject matter rather than in reading/writing or other 
subjects. The reviewers concluded that SRL can be fostered effec- 
tively at both primary- and secondary-school levels. The current 
research explores whether a technology-based tool for fostering 
SRL would also have positive impacts. 


Knowledge Tools and Electronic Portfolios 


The potential for technology to radically transform and improve 
education is widely recognized by policy makers, scholars, and 
practitioners (Campuzano, Dynarski, Agodini, Rall, & Pendleton, 
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2009; Canadian Council on Learning, 2008; CEO Forum on Ed- 
ucation and Technology, 2001; Dynarski et al., 2007; Ungerleider 
& Burns, 2003; Zimmerman & Tsikalas, 2005); however, there 
have been mixed results when new technologies meet the realities 
of the diverse and changing classroom contexts of schools (Abrami 
et al., 2006; Abrami, Savage, Wade, & Hipps, 2008; Avramidou & 
Zembal-Saul, 2003; Azevedo, 2005; Barrett, 2007; Bernard, 
Bethel, Abrami, & Wade, 2007; Cuban, 1993; Cuban, Kirkpatrick, 
& Peck, 2001). Explanations vary including the lack of technology 
infrastructure, insufficient training and support of teachers (Meyer, 
Abrami, Wade, Aslan, & Deault, 2010), and knowledge tools that 
may lack key design principles guided by what has been learned 
from the learning and motivational sciences (Abrami, 2010; 
Abrami, Bernard, Bures, Borokhovski, & Tamim, 2011; Pintrich, 
2003; Mayer, 2001, 2008; Wozney, Venkatesh, & Abrami, 2006). 
The current development and research project attempts to over- 
come some of these challenges by designing software that is both 
faithful to Zimmerman’s (2000) social-cognitive model of self- 
regulation and by providing both help and support embedded in the 
software as well as face-to-face training and support leading to 
enhanced implementation fidelity. 

The research presented here is a continuation of a line of inquiry 
that explores the impacts of one particular knowledge tool—an 
electronic portfolio (EP). EPs build on the evidence of what is 
already known about effective portfolio pedagogy, and makes 
working with portfolios more engaging, dynamic, and accessible 
for students, teachers, and parents. An EP is a digital container 
capable of storing visual and auditory content, including text, 
images, video, and sound. EPs may also be learning tools not only 
because they organize content but also because they are designed 
to support a variety of pedagogical processes and assessment 
purposes. Historically speaking, EPs are the knowledge age’s 
version of the artist’s portfolio for students in the sense that they 
not only summarize a student’s creative achievements but also 
illustrate the process of reaching those achievements. An artist, 
architect, engineer, or student who displays her or his portfolio of 
work allows the viewer to form a direct impression of that work 
without having to rely on the judgments of others. EPs tell a story 
both literally and figuratively by keeping a temporal and structural 
record of events. EPs can offer valuable opportunities for integrat- 
ing technology into K-12 classrooms beyond serving as multime- 
dia containers. They may serve to deepen a student’s learning 
experiences by placing the student at the center of his or her 
learning and scaffolding essential meta-cognitive skills such as 
goal setting, identifying strategies, and reflecting on one’s learning 
(Abrami & Barrett, 2005). 

According to Abrami and Barrett (2005), EPs have three broad 
purposes: process, showcase, and assessment. All three types of 
EPs can be used to display selected artifacts and would enable 
learners to develop their metacognitive skills in choosing their 
pieces, reflecting on how they meet assessment criteria and re- 
working elements on the basis of feedback. We chose to work with 
EPs designed as process portfolios that support how users learn 
through embedded structures and strategies. A process EP can be 
defined as a purposeful collection of student work that tells the 
story of a student’s effort, progress, and/or achievement in one or 
more areas (Arter & Spandel, 1992; Barrett, 2007; MaclIsaac & 
Jackson, 1994). Process portfolios are personal learning manage- 
ment tools. They are meant to encourage individual improvement, 
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personal growth and development, and a commitment to lifelong 
learning. The authors are especially interested in the use of EPs as 
process portfolios to support learning. 

Process EPs are gaining in popularity for multiple reasons 
(Abrami & Barrett, 2005; Barrett, 2009; Zubizarreta, 2004). They 
provide multimedia display and assessment possibilities for school 
and work contexts, allowing the use of a variety of means to 
demonstrate and develop understanding— especially advantageous 
for children whose competencies may be better reflected through 
these authentic tasks (Barrett, 2007). At the same time, by engag- 
ing learners, their deficiencies in core competencies may be better 
overcome. Process EPs may scaffold attempts at knowledge con- 
struction by supporting reflection, refinement, conferencing, and 
other processes of self-regulation, important skills for lifelong 
learning and learning how to learn. They are useful for cataloging 
and organizing learning materials, readily illustrating the process 
of learner development. They can also provide remote access 
encouraging anywhere, anytime learning and easier input from 
peers, parents, and teachers (Barrett, 2008). 

Process EPs are linked to students’ abilities to self-regulate their 
learning and to enhance their development of important educa- 
tional skills and abilities, especially literacy skills (Meyer et al., 
2010; Wade, Abrami, & Sclater, 2005). When students use port- 
folios, they assume more responsibility for their learning, better 
understand their strengths and limitations, and learn to set goals 
(Hillyer & Lye, 1996). 

Unfortunately, evidence to date on the impacts of EPs on learn- 
ing and achievement is sparse (Barrett, 2007; Carney, 2005; Zeich- 
ner & Wray, 2001). Some research has filled this gap by studying 
the impact of EPs on teaching and learning processes, especially 
those related to self-regulation, in late elementary classrooms. 
Abrami et al. (2008) found that K-12 teachers faced some chal- 
lenges when attempting to integrate process EPs into their teach- 
ing; hence, the levels of use throughout the school year were fairly 
low. Although the teachers had a positive view of EPs and SRL 
processes, most teachers needed to adjust their teaching strategies 
to effectively incorporate EPs into their teaching and needed time 
and support to do so. 

Meyer et al. (2010) studied high and medium EP implementing 
experimental teachers, who had been trained and supported, plus 
nonimplementing control teachers and their students. In this quasi- 
experiment, the researchers found that older elementary students 
who were in classrooms where the teacher provided regular and 
appropriate use of the EP tool, compared with contro] students who 
did not use the tool, showed significant improvements in their 
writing skills on a standardized literacy measure and certain meta- 
cognitive skills, such as monitoring, measured via student self- 
report. This is the first study to offer some evidence that teaching 
with EPs had positive impacts on students’ literacy achievement 
and SRL skills when the tool was used regularly and integrated 
into classroom instruction. 

Meyer, Abrami, Wade, and Scherzer (2011) collected data to 
understand how teachers used EPs in their classrooms, to what 
extent they integrated the EP into their practice, and the factors that 
influenced their use. They found that low implementers experi- 
enced significant technical obstacles and/or were reluctant to 
change their established practices, whereas high implementers 
reported feeling supported by their administration and experienced 
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growth in their teaching practice as a result of the scaffolding and 
support provided by the software. 

On the basis of these findings, Meyer et al. (2011) made several 
recommendations to increase the quality of EP use. First, it is 
essential to ensure that the classrooms and schools have sufficient 
technical infrastructure to support the innovation. Second, teachers 
and students must have consistent access to functioning computers 
that are regularly maintained. Third, teachers must feel that there 
is positive support from the administration to invest the time in 
learning to teach with EPs. Otherwise, teacher training needs to 
focus more on why using educational technology such as EPs is 
important and appropriate rather than only how to use EPs. Most 
EPs are not technically difficult tools to use, but they may be 
pedagogically challenging. Some EPs focus on student-centered 
learning, which means that teachers need to accept classroom 
practices that go beyond didactic forms of instruction. The last 
suggestion is to provide external encouragement for teachers to 
adapt an innovation and provide a culture that values experimen- 
tation, improvement, and evidence-based practices. 

One purpose of the current research was to determine whether 
the findings of Meyer et al. (2010, 2011) could be meaningfully 
extended. By ensuring better access to technology, enhanced sup- 
port and recognition from school administrators, and better and 
more professional development both embedded in the tool (e.g., 
just-in-time instructional video vignettes) and via training and 
follow-up support, it was hoped that the number of high-EP 
implementers would increase and that important learning out- 
comes would be evident. Dignath and Buettner (2008) and Dignath 
et al. (2008) found that researcher implementations of SRL train- 
ing have larger effects than teacher implementations. But in the 
current investigation, teacher implementations were chosen as a 
means to foster authentic classroom practices where teachers and 
their students used the software for extended periods to teach and 
learn required curricular content. It was hoped that these would be 
scalable and sustainable practices, not researcher-made demonstra- 
tions, and that the ways to promote effective and long-lasting 
change in real classrooms would be better understood. 

Abrami (2010; Abrami et al., 2011) commented critically on the 
widely held belief that learners are motivated to use knowledge 
tools for learning. First, learners may not value the outcome(s) of 
learning sufficiently to increase their efforts to learn—it is not so 
important to do well. Second, learners may believe that gains in 
learning from increased effort are inefficient—it takes too much 
effort to do a little bit better. Third, learners may not want to 
become more responsible for their own learning—it is too risky 
unless the perceived chances of a positive outcome are increased. 
Fourth, learners may believe that novel approaches to learning 
increase the likelihood of poor outcomes, not increase them—it is 
not of interest or too risky because they do not believe the tool will 
help them learn. Therefore, a secondary purpose of this research 
was to explore the extent of student engagement and satisfaction 
with the use of EPs. 

Zimmerman (2008) published an overview of SRL research that 
studied several online tools designed to stimulate and study vari- 
ous SRL processes in students. He identified four key trends that 
remain as important questions, including the relationship between 
student reports of SRL and actual use of SRL processes, the 
relationship between levels of SRL and overall academic achieve- 
ment, the role of the social context of the classroom in stimulating 
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or hindering the development of SRL skills, and finally the rela- 
tionship between motivation and SRL processes. The research 
presented in this article addresses two of these four areas: the 
relationship between levels of SRL and overall academic achieve- 
ment and the role of the social context of the classroom in stim- 
ulating or hindering the development of SRL skills. 


Research Context and Method 


About ePEARL 


Zimmerman and Tsikalas’ (2005) review of computer-based 
learning environments (CBLEs) designed to support SRL provides 
a framework for the development of a tool to support the three 
cyclical phases of SRL: forethought, performance, and self- 
reflection. The lessons learned from other partially SRL- 
supportive CBLEs has enabled us to plan for effective SRL- 
supportive design of an Electronic Portfolio Encouraging Active 
and Reflective Learning (ePEARL). 

The Centre for the Study of Learning and Performance (Con- 
cordia University, Montreal, Canada), in collaboration with our 
partner LEARN (Montreal, Canada), developed ePEARL as a 
bilingual (English and French), web-based, student-centered EP 
software tool, which is designed to support the phases of self- 
regulation (Zimmerman, 2000; Zimmerman & Tsikalas, 2005). 


Insulating machine 


Folder Science/technolo 
Colour Code finish @ 
Date 01/22/09 


Teacher Colour Codes © achieved competency xi 


Task Description 


Task Description 
Make an Insulating machine 


Criteria 
The machine should keep the water hot 


Goals 


Task Goals Updated 04/22/09 

1.Choose item to make a machine 

2. Built the machine 

3.write hypothesis 

4.test the temperature before and after the test 
5. test it 

6.write the conclusion 


Strategies Updated 04/22/09 


1.We did research to choose the best heat perserving materials 


2.Carefully measure the temperature 
3.Cooperate with other groups 
4.Write a logical hypothesis and conclusion 


@ Teacher Updated 04/21/09 
Specific and therefore excellent! 


Figure 1. 


The latest version of ePEARL may be explored by visiting 
http://grover.concordia.ca/epearl/promo/en/index.php. The slightly 
older version (3.0) used in this research is archived on our univer- 
sity server. The software is available at no cost to educators. 

EP tasks involved in the forethought phase are setting outcome 
goals, setting process goals, documenting goal values, planning 
strategies, and creating learning logs. Tasks involved in the per- 
formance phase are creating work, self-examination through re- 
cordings and drafts of work, and learning log entries. Tasks in- 
volved in the self-reflection phase are reflecting on work, process, 
and feedback received and becoming aware of new goal opportu- 
nities. 

Developed in Hypertext Preprocessor (Version 5.X) using a 
MySQL database, four levels of PEARL were designed for use by 
children, teachers, and adult learners in early elementary, late 
elementary, secondary, and postsecondary schools and institutions. 
The forethought or planning phase includes the following features: 
describing the task, setting outcome and process goals, identifying 
strategies to achieve those goals, and providing a place for teachers 
to provide scaffolding and feedback. Figure | displays the plan- 
ning phase for a science project for which the students worked in 
teams to build an “insulating machine” that would keep water hot. 
Students enter information under “Task Description,” “Criteria,” 
“Task Goals,” and “Strategies”; these processes are often carefully 


® printable version 


My General Goals 


Insulating Machine project: Planning. 
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scaffolded and modeled by the teachers. The teacher completes the 
“teacher color codes” and the yellow feedback box at the bottom. 

Features available in the performance or doing phase include 
creating new work; linking to existing work; attaching digital files 
to document the work completed; and space for teacher feedback 
on the content. Figure 2 displays the content and the teacher’s 
feedback for the insulating machine assignment. Students com- 
posed the text seen in Figure 2 and uploaded the image, after which 
the teacher provided feedback. 

Features available in the reflecting phase include reflecting on 
work; sharing work; obtaining feedback from teachers, peers, and 
parents; editing work; saving work under multiple versions; and 
sending work to a presentation folder. ePEARL promotes the 
creation of general learning goals for a term or year, or for a 
specific work/artifact; reflection; and peer, parent, and teacher 
feedback on the entire portfolio or on a specific artifact. Students 
and teachers can monitor individual progress toward completing 
each SRL component by viewing the ePEARL index page. Figure 
3 presents a sample index page from a student portfolio. This index 
page displays the title of each artifact, the date modified, the 
number of file attachments (A), if goals have been set (G), if 
reflections have been completed (R), and if comments have been 
provided (C) by teachers (T) or students (S). The remaining 
columns (BAL, CCC, SA, CK) link to specific learning outcomes 
described by local government education policy. 


Content 


Text 
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ePEARL guides students through the creation process, allowing 
enough flexibility for truly creative work and just enough scaf- 
folding to keep students on the right track. It offers a text editor 
and an audio recorder for the creation of work. Readings, music 
pieces, or oral presentations may be recorded. The software also 
offers the ability to attach work completed using other software, so 
it can accommodate any kind of digital work a student creates, 
including videos, slideshows, podcasts, scanned images, or pho- 
tographs of paper-based work. See Figure 2. 

Before work is created, students are encouraged to set goals for 
their work, and may attach learning logs, evaluation rubrics, and 
study plans to keep track of their learning process as it takes place. 
After the creation of work, sharing with peers or teachers is 
supported so that students may solicit feedback on drafts of work. 
Figure 4 shows the sharing screen, which allows students to select 
individual classmates or entire classes that they would like to give 
permission to in order to view and comment on their work. In this 
image, you also see the graphic of the SRL process that is present 
through the planning stages to help students remember each step of 
the process. Scaffolding for these processes is embedded in the 
portfolio, as seen in the text box that describes the purpose of 
“sharing” an artifact in order to get feedback. Students may also 
reflect on their performance and strategies and use these reflec- 
tions to adjust their goals for the next work. 


Were attaching a file with a picture of our insulating machine. 





Files 
CIMG1200.3PG 
view | download 
Feedback 


€) Teacher Feedback © 


Teacher Updated 04/21/09 

You describe the task and goals with precision. 
The picture of your insulating machine is very 

good too. Please tell us what you learned from 
this experiment. 


Fantastic! 


Figure 2. 


Insulating Machine project: Content and teacher feedback. 
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Figure 3. 


The presentations folder is where students collect their selected 
artifacts, created either from within ePEARL’s text editor or audio 
recorder or from outside the tool. The selection process allows 
students to reflect on why a work belongs in their portfolio; its 
relationship to other work; and on their own advancements. Self- 
regulation is also supported when students create new goals for 
future work or modify learning behavior on the basis of reflections 
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Artifacts index page. 


linked to work they have recently created. Sharing with peers and 
parents is encouraged, and teachers have automatic viewing access 
to their students’ portfolios. Figures 5 and 6 display an artifact for 
areader’s theater and the feedback shared by peers and teachers on 
this presentation. In Figure 5, the student described the “Task” and 
“Criteria” after a class discussion and some modeling of ideas for 
these portions. Then, the student established and entered his or her 


Sharing [X/ 


a) 
| + Aclassmate’s opinion about your work can be very 
helpful. 
+ They can give you ideas to improve your wark. 


Figure 4. Reflection screen: Sharing. 
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Figure 5. Readers theater: Planning. 
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Task Description 


additional support materials. In fact, the ePEARL project team 
went to some length to ensure that both students and teachers were 
knowledgeable about the features of PEARL and the processes of 


“Strategies.” In Figure 6, the student composed 


a reflection at the end of the project; the teacher and the student’s 
peers could view the final artifact and enter their feedback. 


“Task Goals” and 


SRL. This effort was reflected not only in the embedded multi- 


In addition, there are both prose and extensive multimedia 
support materials for teachers and students to develop a better 


media and prose support that are part of the tool but also because 


of the training and follow-up support the team provided throughout 


the project. 


understanding of the what, why, and how of the self-regulation 


processes supported by ePEARL. The research team created a wiki 


and a virtual tutorial to help support teachers’ implementation of 
the SRL features within ePEARL. A series of six “jump start” 


lessons were provided to members of the wiki to help teachers 


Participants in this study were 21 teachers from elementary 


schools (Grades 4—6) and their students (NV 
urban and rural English school boards in Quebec and Alberta, 


Canada, who participated during the 2008-2009 school year. All 


experimental teachers (n 


introduce ePEARL and various SRL processes such as setting 


and re- 


setting task goals, 


flecting and providing feedback. An overview of this jump start 


2 


general goals, organizing your work 


program is provided in Appendix A. The virtual tutorial contains 


a series of 2-min videos showcasing the software and providing 


pedagogical supports and examples for classroom teachers. This 


training on the use of ePEARL from research center staff and 


follow-up support, including lesson plans and job aids, an online 


tutorial can be accessed by visiting http://grover.concordia.ca/ 


epearl/tutorial/index.php. Additionally, supports were embedded 


discussion forum (in the form of a moderated wiki), as well as 


in-class observations and model lessons during the school year. In 
addition, multimedia scaffolding and support for teachers and 


students are embedded in the tool. 


within the software through help buttons that both students and 


teachers could access that provided definitions of SRL terminol- 
ogy, sample responses, and hyperlinks to the virtual tutorial. The 


School principals and school board administrators were con- 
sulted to identify control teachers and their classrooms that would 


professional development and just-in-time materials support the 


demonstration and modeling of student-centered skills and instruc- 


match as closely as possible the experimental teachers and their 


tion, explanations of those skills, and elaboration of skills through 
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Figure 6. Readers theater: Reflecting and feedback. 


classrooms. All teachers needed to follow provincial curriculum 
requirements for the development of language arts skills. Experi- 
mental teachers did so with the aid of the software, whereas control 
teachers did not. All teachers were at liberty to decide on the 
provision of the type of language arts instruction. There were no 
special language arts materials provided to either experimental or 
control teachers by the research team. Given there was little 
incentive to participate, control teachers were offered a stipend that 
could be used toward the purchase of classroom resources or 
professional development. Informed consent was obtained from 
students’ parents following Canada’s Tri-Council Policy on the 
ethical treatment of research participants. 

The study used multivariate and univariate analyses of covari- 
ance to identify differences between the experimental and control 
groups on measures of reading, writing, and self-regulation. 
Teacher and student questionnaire data on self-regulation were 
collected in September and October of 2008. Teacher and student 
questionnaire data were collected again in May and June of 2009 
after the software was used for some part of the school year, 


ranging from 6 to 8 months. In addition to questionnaires, all 
students completed the constructed response subtest of the Cana- 
dian Achievement Test, fourth edition (CAT-4; Canadian Test 
Centre, 2008) in both the fall and spring to assess their reading and 
writing skills. 


Instrumentation 


The CAT-4 assesses both response to text (ideas, support) and 
writing (content, content management) using a rubric. The results 
of the CAT-4-constructed response reading and writing activities 
were sent to the Canadian Test Centre for evaluation and as part of 
their norming study. They assigned final scores to all the students 
that were then mailed back for inclusion in the data set. The 
constructed response subtest depends on student narrative re- 
sponses to prompts as opposed to the multiple-choice format of the 
main tests of the CAT-4, which also measures student literacy. 
Multiple story prompts were used in each class at both pretest and 
posttest, but no student responded to the same prompt twice. This 
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form of measuring literacy achievement was used because it was 
compatible with notions of authentic assessment, even though it 
meant we generated a less detailed analysis of student learning 
than using the closed-ended version of the CAT-4. The reliability 
coefficients (Kuder-Richardson 20) for CAT-4 subtests range be- 
tween 0.85 and 0.95, depending on the level and subtest. In the 
previous version (CAT-3), test validity was established by show- 
ing that grade levels that were known to have different levels of 
achievement did indeed have different mean scores on the same 
test. 

Abrami and Aslan (2007) developed the Student Learning Strat- 
egies Questionnaire (SLSQ; see also Abrami et al., 2008, and 
Appendix B) used in the current and past studies to measure 
students’ perceptions of their use of SRL strategies, including their 
ability to set learning goals, observe and correct their performance, 
and reflect on the learning outcome. The SLSQ contains six scales, 
namely, Goal Setting, Strategy Planning, Self-Observation, Self- 
Instruction, Feedback from Adults, and Self-Evaluation. In addi- 
tion, students were asked to complete the Academic Self- 
Regulation Questionnaire (SRQ; Ryan & Connell, 1989). This 
32-item instrument measures four subscales of self-regulation, 
namely, External, Identified, Intrinsic, and Introjected. The com- 
posite score of the SRQ, known as the Relative Autonomy Index, 
is calculated from the four subscales as follows: 2 X intrinsic + 
identified — introjected — 2 X external. 

At the end of the SLSQ, experimental students were also asked a 
series of open-ended questions about their experiences with ePEARL. 
These questions included items such as, “I like using ePEARL in my 
class because...” and “TI did not like using ePEARL in my class 
because. . .” as well as “What I liked most about using ePEARL is. . .” 
and “What I liked least about ePEARL is. . .” Responses were coded 
to measure student enthusiasm as 1 = Low, 2 = Medium, 3 = High. 
A sample high-enthusiasm statement was, “It was a really fun 
thing to go on. I liked it so much.” A sample medium- 
enthusiasm statement was, “It was an OK program to use to 
help us learn. I’m not a huge fan.” A sample low-enthusiasm 
statement was, “I didn’t like anything. It was annoying, and I’m 
so glad I don’t have to go back on it.” Enthusiasm statements 
were coded by two independent raters, and the four instances of 
disagreements were resolved through discussion. 

At two points during the year (April and June), teachers com- 
pleted an Implementation Fidelity Questionnaire (IFQ; Meyer et 
al., 2010, 2011) that asked them to report on how many hours a 
month they had been using ePEARL as well as describe what was 
going well and what were challenges they were facing. In order to 
measure implementation fidelity, Meyer et al.’s (2010, 2011) Im- 
plementation Assessment Protocol (IAP v2) was slightly revised. 
This protocol assessed the data reported on the IFQ and a sample 
of student portfolios in each classroom to determine the following: 
average number of artifacts, date range of use, and the degree to 
which students were using all of the available features of the 
software such as goal setting, attaching artifacts, feedback, and 
reflection. The IAP v2 allows each experimental classroom to be 
assigned a degree of implementation: low, medium, or high. For 
example, low-implementation classrooms would be those that re- 
ported less than 4 hr of ePEARL use each month; the student 
portfolios had zero to three student artifacts; and the artifacts 
would often be incomplete (1.e., work would be stored, but few of 
the SRL features would be used). Medium-high implementation 
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classrooms would be those that would report using ePEARL 5 or 
more hr each month; the student portfolio would have at least four 
artifacts; and the artifacts would include goals, content, and a 
reflection. Table 1 provides an overview of the IAP criteria for 
ePEARL implementation. 

A research team member visited classrooms one to three times 
during the school year. These classroom visits served both data 
collection and implementation support purposes. The research 
team member presented a mini lesson on a feature of SRL and 
provided any necessary technological training and support. The 
content of these visits was documented in field notes that tracked 
how ePEARL was being used by the teachers and the students. At 
the end of the school year, teacher exit interviews were conducted 
in experimental classes using the Teacher Exit Interview Protocol 
(Meyer et al., 2010, 2011). This semistructured interview protocol 
was designed to explore the reasons for teachers’ varying degrees 
and types of implementation, including their expectations, access 
to technology, support from administration and tech personnel, 
familiarity with portfolio pedagogy, knowledge of SRL processes, 
and time management issues. 


Analyses 


Two raters analyzed a random sample of each classroom’s 
portfolios and independently assigned each classroom a rating of 
low, medium, or high implementation on the basis of the criteria 
outlined in the IAP v2. Raters achieved 90% agreement and 
together determined the IAP scores for each of the experimental 
teachers. All experimental classrooms were identified as medium- 
or high-implementation classrooms. 

Cronbach’s alpha, which measures internal consistency, was 
found to be .86 for both the SLSQ total pretest and posttest scores; 
alpha for the SRQ pretest was .91, and for the SRQ posttest, alpha 
was .88. Reliability for the six subscales of the SLSQ ranged from 
.81 to .88, whereas those of the four subscales for the SRQ ranged 
from .86 to .91. All effect size (ES) calculations follow Cohen’s f 


(Cohen, 1988) for F ratios produced in analyses of variances, 
2 





f= roe According to Cohen (1988), an f value of .10 can be 
=n 


considered as a small effect, .25 would be a medium effect, and .40 
a large effect; however, these labels must be interpreted in the 
context of the research study; for example, a medium ES for a 
quasi-experimental study might be interpreted as more valuable 
than a medium ES for an experimental study. We report ESs for 
both multivariate as well as univariate analyses because the former 
may be misleading in that it is adjusted for covariance among 
outcome variables. 

We used both design and statistical adjustments while inspect- 
ing and screening our data. From the initial experimental sample of 
206 students, we decided to hold aside 39 (18.93%) students from 
the experimental group on the basis of their low enthusiasm for 
using the software because these students’ achievement and self- 
regulation scores were significantly lower than those with either 
medium (n = 84 or 40.78%) or high enthusiasm (n = 83 or 
40.29%). These low-enthusiasm students merit a separate analysis, 
which is detailed below, but are excluded from the main analyses 
as they were resistant to participating in the study and hesitant 
about using EPs for learning. Multivariate analyses using the 
pretest scores for CAT-4, the SLSQ, and SRQ as covariates; the 


Table | 


ELECTRONIC PORTFOLIOS 


Implementation Assessment Protocol 
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Criterion 


IFQ— hr/month 

Avg. # artifacts 

Date range of use 

Uses of EP’s 

Planning: Goals & Strategies 


Doing: Content 


Reflecting 


Feedback 


Presentation folder 


Low Medium 


Hr <4 5-12 
Artifacts = 3 4-6 
Entries span less than 60 days Entries span 61-120 days 


¢ | or no General Goals e At least 2 General Goals 


¢ | or no Task Goals * At least 3 artifacts have goals/strategies, 
content, & reflection 

° Goals & strategies may be vague, 
inappropriate, or may be attached to a 
grade/mark 

¢ Artifacts may be in only one subject area 


* | or no Strategies 
* Storage only 
¢ Incomplete entries ° At least 3 artifacts have content 


¢ Content is missing in some artifacts 


¢ At least 3 artifacts have reflections 
° Reflections are brief and generally vague 
(“I liked it, I had fun’) 


1 or no reflections 


No feedback e Teacher feedback in fewer than 3 
artifacts 

e Feedback is summative only (gives a 
score/directive) 


¢ Feedback from peers has a lot of “chat” 


¢ 1-2 items 
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High 


eal 
7 & artifacts 
Entries span 121 days or more 


¢ 3 or more General Goals—some may 
have been revised 

¢ 4 or more artifacts have goals & 
strategies 

* Goals & strategies are clearly defined 
and appropriate to the task 


¢ Creative use of EP (different 
attachments, well-developed home 
page) 

¢ 4 or more artifacts have content 
included (attachments, text editor, 
audio files) 

* Artifacts included from multiple 
subject areas 

¢ Multiple versions of artifacts 

¢ 4 or more artifacts have reflections 

¢ Reflections show deep thought about 
learning process and/or addresses 
goals & strategies 

¢ Teacher feedback on 4 or more 
artifacts 

¢ Feedback offers suggestions, asks 
questions, stimulates reflection 

¢ Feedback from peers includes 
constructive suggestions 

¢ At least 3 items with 


Empty 


reflection/selection explanation 
included 


¢ No reflection/selection reasons given or 
icons only selected 





Note. 


posttest scores for the CAT-4, SLSQ, and SRQ as dependent 
variables; and the three levels of enthusiasm yielded a significant 
value of Pillai’s trace of 1.116, F(9, 588) = 38.70, p < .001, ES = 
./6. Subsequent univariate tests showed that low-enthusiasm stu- 
dents (M = 47.85, SD = 10.59) scored significantly lower than 
their counterparts in the medium- (MV = 55.90, SD = 8.90) and 
high-enthusiasm groups (M = 57.40, SD = 10.57) on their SLSQ 
posttest scores, F(2, 196) = 3.77, p < .05, ES = .18. Similarly, 
students in the low-enthusiasm group (M = 7.76, SD = 2.41) 
scored significantly lower than those in the medium-enthusiasm 
group (M = 8.53, SD = 2.22) on the CAT-4 achievement posttest 
score, F(2, 200) = 3.77, p < .05, ES = .18. 

An additional reduction in sample size of 46 students occurred 
after we conducted a missing value analysis using SPSS software 
and excluded those who did not respond to items in either the 
CAT-4 achievement tests or the SRQ and SLSQ instruments. 
Analyses using SPSS version 19 revealed the values were missing 
at random, and hence, although removing these data might affect 
the power of the ensuing analyses, any parameter estimates would 
be unbiased. Finally, 79 control group participants were eliminated 
from the sample to establish pretest equivalence between the 
groups. After removal, results of a between-groups multivariate 
analysis of variance using pretest CAT-4, SRQ, and SLSQ scores 





IFQ = Implementation Fidelity Questionnaire; Avg. = average; EP = electronic portfolio. 


as dependent measures revealed a nonsignificant Pillai’s trace of 
002 (p = .90). This yielded a final sample of 319 (n for experi- 
mental group = 154, n for control group = 165) with no missing 
data for any variables. 

All survey data were entered into SPSS version 19 by two 
graduate research assistants and verified for accuracy. For all 
measures, analyses were run using a multivariate analysis of co- 
variance (MANCOVA) design, with the treatment (experimental, 
control) as the independent variable, pretest scores as covariates, 
and posttest scores as dependent variables. For all analyses, results 
of evaluation of assumptions of normality, homogeneity of 
variance-covariance matrices, linearity, and multicolinearity were 
satisfactory. Customized models were tested in the MANCOVA to 
ensure that the homogeneity of regression slopes assumption was 
met. There were no univariate or multivariate within-cell outliers 
at a = .001. Questionnaire data were analyzed by item and 
aggregated to their respective scales, to obtain a fine-grained 
analysis of specific changes in self-regulation that occurred as a 
result of ePEARL use. These quantitative data, combined with an 
analysis of carefully selected student portfolios and classroom 
observations, provide a rich picture of how the use of ePEARL 
supports the academic achievement and self-regulation of learners 
who participated in the study. 
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Results 


The firsts MANCOVA was conducted with the three posttest 
composite scores of the CAT-4, with SLSQ and SRQ treated as 
dependent variables, the EP-control treatment as the fixed factor, 
and the three pretest composite scores of the CAT-4, SLSQ, and 
SRQ as covariates. The multivariate test for main effect of the 
treatment on the dependent posttest measures was significant (Pil- 
lai?ss trace =» 24), F(6,°631) = 85.58, p < .001, BS*="35: 
Follow-up tests showed that the experimental group significantly 
outperformed the control group on the SRQ posttest measures, 
F(1, 314) = 10.09, p < .01, ES = .27, after accounting for 
differences in pretest scores. See Table 2 for descriptive statistics, 
covariate values, and adjusted means for the pretest and posttest 
CAT-4, SRQ, and SLSQ composite measures. In addition, all the 
reported analyses were unchanged when the Huber-White correc- 
tion (Huber, 1967; White, 1982) was applied to control for error 
due to nonnormal distributions of residuals in the dependent vari- 
ables, with teacher, school, and province entered as clusters in the 
model. 

A second MANCOVA was conducted with the CAT-4 posttest 
subscales of ideas presented in students’ response to text, support 
for response to text, content presented in the writing assignment, as 
well as content management in writing skills as dependent vari- 
ables, treatment versus control groups as fixed factors, and the 
CAT-4 pretest subscales as covariates. Neither the SLSQ nor the 
SRQ measures were included in these analyses. The multivariate 
test for the main effect of treatment on the CAT-4 posttest sub- 
scales after having accounted for differences in pretest subscale 
scores was significant (Pillai’s trace = .144), F(8, 624) = 6.33, 
p < .001, ES = .27. Post hoc tests revealed that students using 
ePEARL had significantly greater posttest scores compared with 
controls in the subscales of providing support for response to text, 
F(i, 313) = 11.79, p < .01, ES = .27; content presented in the 


Table 2 

Descriptive Statistics and Adjusted Means From MANCOVA 
With Achievement and Self-Regulation Composite Measures for 
Experimental and Control Groups 





Composite Experimental group Control group 
score 
means Pre* Post? Pre* Post? 
CAT-4 6.79 8.19 (8.09) 6.79 7.48 (7.37) 
SD 2.06 228 1.46 1.96 
SRQ® 2289 = 17.94 (20159) 22:4 = 29.83 (= 25.45) 
SD 17.23 17.24 18.19 19.00 
SLSQ 56.78 57.80 (57.45) 56.011 55.46 (55.83) 
SD 10.34 oe 10.64 10.63 


Note. n= 154 (experimental), 165 (control). MANCOVA = multivariate 
analysis of covariance; CAT- 4 = Canadian Achievement Test, fourth 
edition; SRQ = Academic Self-Regulation Questionnaire; SLSQ = Stu- 
dent Learning Strategies Questionnaire. 

* Pretest covariates in the MANCOVA were set at the sample 
mean. ° Adjusted means for posttest scores calculated in MANCOVA 
appear in parentheses. ©The SRQ Relative Autonomy Index = 2 X 
intrinsic + identified — introjected — 2 X extrinsic; this means that 
Intrinsic and Identified subscales contribute positively, whereas the Intro- 
jected and Extrinsic subscales reduce the self-regulation score in learners. 
The less negative the composite SRQ score, the better the overall relative 
autonomy index of the learner. 
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Table 3 

Descriptive Statistics and Adjusted Means From MANCOVA 
With CAT-4 Subscale Measures for Experimental and Control 
Groups 


mE 





Experimental 
group Control group 
CAT-4 subscale means Pre* Post? Pre* Post? 
Response to Text: Ideas 1.71 1.99(1.98) 1.73 2.04 (2.03) 
SD 61 .66 23 .62 
Response to Text: Support 1.71 2.10(2.01) 1.68 1.83 (1.86) 
SD 68 i) 59 74 
Writing: Content LIQ DA WOC@ Wy — eS — dekh. esis) 
SD 67 70 62 63 
Writing: Content Management 1.64 2.00(1.95) 1.63 1.73 (1.71) 
SD .67 69 hk 64 





Note. n= 154 (experimental), 165 (control). MANCOVA = multivariate 
analysis of covariance; CAT-4 = Canadian Achievement Test, fourth 
edition. 

4 Pretest covariates in the MANCOVA were set at the sample 
mean. ° Adjusted means for posttest scores calculated in MANCOVA 
appear in parentheses. 


writing assignment, F(1, 313) = 9.78, p < .01, ES = .25; and 
content management in writing skills, F(1, 313) = 12.00, p < .01, 
ES = .27. See Table 3 for detailed descriptive statistics, covariate 
values, and adjusted means for the CAT-4 subscale responses. 

A third MANCOVA was conducted with the SLSQ posttest 
subscales as dependent variables, experimental versus control 
group as the fixed factor, and SLSQ pretest subscales as covariates, 
and these yielded a significant Pillai’s trace of .20, F(12, 612) = 
33.44, p < .001, ES = .33. Follow-up tests revealed that partici- 
pants using ePEARL showed significantly higher posttest scores 
than the controls on five of six subscales of the SLSQ: Goal 
Setting, F(1, 311) = 11.58, p < .01, ES = .20; Strategy Planning, 


Table 4 
Descriptive Statistics and Adjusted Means From MANCOVA 
With SLSQ Subscale Measures for Experimental and Control 
Groups 


Experimental group Control group 


SLSQ subscale means _ Pre® Post? Pre® Post? 
Goal Setting 7.04 7.60 (7.49) 6.79 6.83 (6.75) 
SD 1.63 1.50 eS 1.75 
Strategy Planning 5 8.10 (7.95) 8.01 eT G2) 
SD 1.59 1.46 1.34 W335) 
Self-Observation 14.77 14.99(14.82) 14.90 14.48 (14.35) 
SD Dey) PES Dee) 2.82 
Self-Instruction 11.11 10.81 (10.74) 10.86 10.41 (10.26) 
SD DAG 2.00 2.36 2.50 
Feedback from Adults 8.35 8.33 (8.34) 8.50 8.43 (8.41) 
SD 1.64 Neg, 1.41 1.50 
Self-Evaluation 7.56 7.97 (7.79) Heo 7.54 (7.60) 
SD 1.50 1.45 1.63 1.56 





Note. n= 154 (experimental), 165 (control). MANCOVA = multivariate 
analysis of covariance; SLSQ = Student Learning Strategies Question- 
naire. 

“Pretest covariates in the MANCOVA were set at the sample 


mean. ° Adjusted means for posttest scores calculated in MANCOVA 
appear in parentheses. 
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Table 5 


Descriptive Statistics and Adjusted Means From MANCOVA 
With SRQ Subscale Measures for Experimental and Control 


Groups 
seat ene ee eee tyre ete Fh ieee) ib et He 


- 


Experimental group Control group 





SRQ subscale means® _— Pre® Post? Pre® Post” 
aE eee OST LS Set he EUR YD. 


Extrinsic 27.70 24.84 (25.75) 27.31 26.95 (26.77) 
SD 3 So 5.10 elo 

Introjected 27.41 24.62 (24.59) 26.30 27.00 (26.01) 
SD 6.11 5.45 Siu Spoil 

Identified J3y3\) PROS CIPS) VPsseh Qo OOD 
SD S207 3.74 4.22 4.05 

Intrinsic 18.31 20.04 (18.20) 17.59 15.45 (15.46) 
SD 3295 3.23 5.46 Sol 





Note. n= 154 (experimental), 165 (control). MANCOVA = multivariate 
analysis of covariance; SRQ = Academic Self-Regulation Questionnaire. 
“Pretest covariates in the MANCOVA were set at the sample 
mean. ° Adjusted means for posttest scores calculated in MANCOVA 
appear in parentheses. “The SRQ Relative Autonomy Index = 2 X 
intrinsic + identified — introjected — 2 X extrinsic; this means that 
Intrinsic and Identified subscales contribute positively, whereas the Intro- 
jected and Extrinsic subscales reduce the self-regulation score in learners. 
The lesser the extrinsic and introjected scores, the more autonomy the 
learner possesses. 


Fd, 311) = 10.55, p < .01, ES = .18; Self-Observation, 
F(1, 311) = 6.33, p < .01, ES = .15.; Self-Instruction, F(1, 311) = 
5.33, p < .01, ES = .14; and Self-Evaluation, F(1, 311) = 5.32, 
p < .01, ES = .14. See Table 4 for the descriptive statistics, 
covariate values, and adjusted means for the SLSQ subscales. 

A fourth MANCOVA of the SRQ subscales as dependent mea- 
sures, experimental versus control group as the fixed factor, and the 
SRQ pretest subscales as covariates yielded a significant Pillai’s trace 
of .20, F(8, 624) = 8.54, p < .001, ES = .32. Post hoc tests showed 
that students who used the EP had significantly better posttest 
scores than the controls for all four SRQ subscales, namely, 
Extrinsic, F(1, 313) = 11.33, p < .01, ES = .26; Introjected, 
F(1, 313) = 11.66, p < .01, ES = .26; Identified, F(1, 313) = 
13.10, p < .01, ES = .28; and Intrinsic, F(1, 313) = 12.54, p < 
.01, ES = .27. See Table 5 for the descriptive statistics, covariate 
values, and adjusted means for these SRQ subscales. 


Illustration of ePEARL Use 


A review of the student work stored in ePEARL using the IAP 
v2 demonstrated that the teachers used ePEARL to help students 
develop SRL skills as well as literacy, information and communi- 
cation technology (ICT), and other content area skills. Some 
sample projects included a reader’s theatre, writing a fable, de- 
signing simple machines, building an insulating machine, and 
constructing a model of a First Nations village. For each artifact, 
students were prompted to describe the task, set task goals, identify 
strategies to accomplish these goals, as well as reflect on their 
progress toward these goals. The following excerpt is a sample of 
one student’s artifact titled “Simple machines.” 


Task Description 


We will use the simple machines of pulleys, rollers, levers, wheels and 
axles, and incline plane. We will build the model using lots of 
different materials in a small cardboard box. 
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Criteria 

1: I will use all 5 simple machines in the playground. 

2: I will be creative and use the simple machines to create new idéas. 
3: My inventions will be detailed and they will work. 

4: The playground will be sturdy and neatly assembled. 


5: I will be able to explain how each simple machine works in my 
playground. 


Goals 
Task Goals Updated 01/23/09 


I will use materials that actually work the way the simple machine 
should. 


Strategies Updated 01/23/09. 

1: I will use my notes as a resource for reminders. 

2: I will use my blueprint as a guide. 

3: I will share ideas with my partner. 

4: I will test materials before I put them in my playground. 


5: I will test all my machines before I share my playground with the 
class. 


Content 
[images of simple machines] 
Reflections Updated 02/10/09 


Me and Joseph did so well, I’ve got my goal done by due date. We 
used all 5 simple machines it was sturdy. I’m pretty sure my science 
grade will be great. Next time I need to get Joseph to stop fooling 
around and help me, it’s geting annoying. 


Teacher Feedback 


It’s hard to work with someone who isn’t putting in the same effort. 
Would you choose a different partner next time? You met your criteria 
and met your goal at the same time! Way to go! 


This text shows how the students used ePEARL to document their 
plan and monitor their progress as they worked on their project 
with a partner. The teacher feedback at the end gives evidence of 
how the teachers worked to give specific feedback to guide their 
reflections and to prompt additional learning on the basis of the 
results of this lesson. 

While visiting the classrooms, the researchers provided sample 
lessons and documented examples of how the teachers were sup- 
porting student development of SRL skills. The following excerpt 
provides a sample of how one teacher introduced the topic of 
reflection. 


[The teacher] presented a lesson on the SMART board about portfo- 
lios and the purposes of them. She had created all kinds of fun games 
in the SMART board that had students engaged and participating in 
identifying the differences between a portfolio and a random collec- 
tion of work/notes/assignments: “shows learning” “you reflect” “you 
select.” She then pulled up a few sample artifacts and showed strong 
examples of student work that had good task goals, strategies, criteria, 


1200 


and reflections. She talked about the importance of reading the goals 
and reflections for her as a teacher then showed the ePEARL video: 
reflecting (works in progress) and talked about how she reflects daily 
and how she asks the students to reflect on their work. 


She then showed the rubric that she had created for the “personal 
heroes project” and had sent to all the students via the portal. She 
explained to them how to attach the rubric and said that was a 
requirement for this assignment, as it was a new ICT skill they needed 
to practice. She then introduced another SMART board game to talk 
about strategies (word scramble to answer a question: “What helps us 
work to achieve our goals?”). She explained that the focus for this 
assignment was to learn to attach the rubric and to reflect on their 
progress as they worked on this project. She told them that this was a 
social studies artifact [indicating the folder they should assign it to] 
and told them to choose a partner to work with on the computer. They 
should help each other enter their plan (they had already completed 
planning sheets). When they were both finished, they were instructed 
to work on their graphic organizer for the project (paper and pencil). 
Students had about 20 minutes to work, and most got the planning 
done (2009—02-22- field notes). 


This lesson demonstrated that the teacher was using the instruc- 
tional supports provided via the software (instructional videos and 
planning sheets) to provide focused instruction on particular SRL 
skills. The teacher interview data indicated that the teachers found 
these visits to be very helpful, and the supports designed to support 
the software provided them the tools and information necessary to 
provide effective instruction on the SRL process. 


Discussion 


The results of this study provide important confirmatory evi- 
dence of the positive impact of the classroom use of EPs on 
students’ literacy skills and SRL strategies. Enthusiastic students 
who used ePEARL in medium- or high-implementation class- 
rooms demonstrated moderate-size learning gains on a stan- 
dardized literacy measure and reported positive changes in key 
SRL skills. Whereas other research explores the processes 
involved in the development and use of SRL, this study adds an 
important element by providing convincing evidence that a 
theoretically based knowledge tool, when wisely and well im- 
plemented by classroom teachers, can have a meaningful impact 
on learning. 

EPs are promoted as knowledge tools that are designed to 
facilitate the integration of technology in classrooms by being fully 
embedded into classroom life rather than merely added to it. In 
contrast to the longitudinal study by Meyer et al. (2010), in the 
current investigation medium or high implementation of the EP 
was achieved in all the experimental classrooms. In these class- 
rooms, the positive impact on learning and self-regulation that a 
process EP can achieve was documented, despite using an instru- 
ment—the constructed response subtest of the CAT-4 —that is not 
especially sensitive to small, subtle changes in student literacy 
skills At the same time, it was noted that not all students were 
uniformly enthusiastic about ePEARL. Fortunately, the percentage 
of low-enthusiasm students was much smaller (less than 20%) than 
the percentage of medium- and high-enthusiasm students (about 
80%), attesting to the applicability and acceptance of EPs among 
the majority of students, at least through elementary school. The 
low-enthusiasm students did not show the same academic im- 
provements or benefits from the standpoint of developing self- 
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regulatory competencies as their counterparts, as our statistical 
analyses point out, and merit further investigation in future re- 
search, especially given that motivation is a key component of 
self-regulation. 

Prior research and the reviews of research on teaching students 
to self-regulate (Dignath & Buettner, 2008; Dignath et al., 2008) 
point to both the benefits of student self-regulation and instruc- 
tional interventions that enhance these skills in students. The 
current quasi-experiment documents the feasibility of developing a 
knowledge tool that can be used with fidelity by classroom teach- 
ers and received with enthusiasm by the majority of students. 
Teachers, and not researchers, successfully implemented an SRL 
instructional program with a significant impact on students’ SRL 
processes, leading to changes in their reading and writing skills, 
although mathematics might have been an easier subject area to 
affect change. This research provides good evidence with regards 
to (a) ameliorating concerns (e.g., Barrett, 2007) about the lack of 
evidence of the impact of EPs on student learning and (b) reducing 
concerns (e.g., Zimmerman, 2008) about the lack of evidence of 
the impact of SRL knowledge tools on student learning. 


Future Directions 


Winne, Hadwin, and Gress (2010) discuss the importance of 
socially shared self-regulation and coregulation, emphasizing that 
knowledge building often occurs in collaboration with others. 
ePEARL allows students to collaborate with others, but the tool 
does not yet scaffold the strategies for joint productivity as well as 
it might. Revisions to the software to encourage use within coop- 
erative and collaborative learning environments are necessary and 
would fit seamlessly within student-centered contexts. 

Second, further research should explore the extended use of 
ePEARL by students and teachers. Also of value would be a 
follow-up investigation concerning whether learning gains and 
student SRL changes promoted by ePEARL use during one school 
year were sustained over time. 

Third, ePEARL has been linked by the authors with other 
evidence-based educational software, including an early literacy 
tool (ABRACADABRA), an inquiry and information literacy tool 
(ISIS-21), and the prototype of an early mathematics tool (ELM/ 
ORME). Ongoing research will document the challenges and com- 
plexities of using two tools simultaneously, one to scaffold SRL 
strategies and another to scaffold the learning of curricular content, 
and whether this can be done together both effectively and effi- 
ciently. 

Fourth, Idan, Abrami, Wade, and Meyer (2011) developed 
ePEARL for adult learners (i.e., senior secondary, vocational, and 
postsecondary teachers and students) that builds on the design of 
the previous three levels of ePEARL by explicitly scaffolding 
detailed aspects of the motivational, cognitive, and metacognitive 
aspects of SRL. Initial concerns regarding ePEARL Level 4 focus 
on usability, including the acceptance of a more complex and 
comprehensive tool by teachers and students. 

Lastly, ePEARL has been adapted for use by studio music 
teachers and their students, called iSCORE, and research is under- 
way on its use within arts education (Upitis, Abrami, Brook, 
Troop, & Varela, 2012). 
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Cautions and Limitations 


The strengths of this research include the size and geographic 
diversity of the participants; the successful integration of the tool 
as part of classroom practice in medium- and high-implementation 
classrooms; the length of the study; and the use of a standardized 
achievement measure compatible with the underlying philosophy 
of portfolios. The weaknesses of this research relate mostly to 
research design and instrumentation. We recognize that the self- 
report nature of SRL measures have been shown to be somewhat 
inaccurate representations of actual SRL (Jamieson-Noel & 
Winne, 2003), and in our future research, we hope to collect more 
behavioral or log-file data from students’ use of software to better 
understand SRL competencies (Shaikh et al., 2012; Venkatesh & 
Shaikh, 2008). 

A strong quasi-experimental design was used but not a true 
experimental design. Homogeneity of covariance matrices and 
homogeneity of regression assumptions for all the analyses of 
covariance were verified. Although it might be argued that the 
study’s ecological validity was affected due to the suppression of 
data from the ePEARL group students with low enthusiasm for 
using the software, our analysis demonstrates that they signifi- 
cantly underperformed both on the achievement and self- 
regulation measures as compared with the students with medium 
or high enthusiasm, and so their data were best analyzed sepa- 
rately. 


Implementation Issues in the Use of EPs 


On the basis of studying ePEARL use in classrooms both in this 
study and over several years (Abrami & Barrett, 2005; Abrami et 
al., 2006; Abrami, Wade, et al. 2008; Bures, Barclay, Abrami, & 
Meyer, 2009; Meyer, Abrami, & Wade, 2009; Meyer et al., 2010, 
2011; Wade et al., 2005), some valuable lessons were learned: 

1. The use of portfolios should be a school-based or board- 
(district-) based initiative and integrated into regular classroom 
teaching. Use of the EP in one or two classrooms once or twice a 
week will have a smaller impact. 

2. The use of portfolios should begin early in students’ educa- 
tional experience and not be short lived. The processes of self- 
regulation and approaches to pedagogy that electronic portfolios 
support require time for younger students to learn and effort for 
older students to make the transition from traditional, teacher- 
directed methods. 

3. The regular and systematic use of EPs should be undertaken 
when students work on novel, complex, and challenging tasks. 
Unimportant and simple tasks do not require a knowledge tool that 
provides the degree of learner scaffolding apparent in EP software. 

4. Teachers need to develop facility with portfolio processes, 
and they should be supported with appropriate professional devel- 
opment and administrative support. 

5. EPs provide the means to scaffold teachers and students in the 
portfolio process and better encourage self-regulation, although 
these tools are not a sufficient condition for change. 

6. Students and teachers must believe that the change to using a 
process portfolio is valued and necessary for authentic, more 
meaningful learning. The “will” component of SRL is as important 
as the “skill” component. 

This last observation is perhaps the most important. This re- 
search project has operated under the belief that learners benefit 
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from knowledge tools for learning and that they need to learn how 
to use them in order to experience achievement gains. However, do 
learners and their teachers see the value in using these tools for 
learning? And do learners want to learn how to use them? These 
questions touch on a number of dilemmas in contemporary edu- 
cation—the challenges of creating and sustaining effective 
student-centered learning environments, the difficulties in integrat- 
ing technology in classrooms, and the obstacles to switching 
pedagogy from emphasizing what content is to be learned to 
emphasizing how content is to be learned. More particularly, 
answering these questions may help explain the failure of other 
researchers to document wide-scale and faithful implementations 
of other EPs and the inability to document the impact of tool use 
on teachers and their students. 

It is hoped that the findings of this research will encourage 
school leaders and teacher educators to recognize the value and 
importance of EPs to support SRL. This study indicates that 
students improve in their writing and certain SRL skills when an 
EP is used regularly and appropriately throughout the school year. 
In order to encourage the effective integration of EPs, or for other 
technological and pedagogical innovations to happen widely and 
well, school leaders, teacher educators, and pedagogical support 
staff need to provide consistent positive support to teachers as they 
learn to teach with new technologies and work within the changing 
realities of their school environments. 

There have been many calls to transform classrooms to become 
student centered and to use technology and knowledge tools as a 
means to promote active, engaged, and reflective learning by 
students. This research has provided some evidence that EPs in 
general, and ePEARL specifically, answer these calls. 
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Appendix A 





Centre for the Study of Learning and Performance 
Electronic Portfolio Encouraging Centre d'études sur l'apprentissage et la performance 


Active Reflective Learning 


Jump Start! 
6 lessons for getting started with ePEARL (Levels 2&3) 


Purpose: 
To introduce teachers & students to ePEARL and using electronic portfolios in K-12 classrooms. 


Objectives: 
To help teachers effectively introduce the basic features of PEARL and the self-regulated 
learning process to their students. 


Time required: 6 lessons (45 — 60 minutes each) 
Materials required: 
1) Jump Start lesson plans 
2) Internet connected computer, speakers & projector or SMART board 
3) List of student usernames and passwords 
4) Mobile lab or reserve the computer lab 


Topics addressed: 
1) Lesson 1: Introduction & Help 
a. What are electronic portfolios? 
b. How does ePEARL work? 
c. How do I log in and personalize my ePEARL? 
d. How dol get help if I’m stuck? 


2) Lesson 2: General Goals & Help 
a. What are General Goals? 
b. How to set good General Goals 
c. How to input General Goals in ePEARL 
d. How to get help setting General Goals 


3) Lesson 3: Organizing your ePEARL 
a. Why do we need to get organized? 
b. Using the agenda 
c. How to organize your work into folders 
d. How to manage colour codes 


(Appendices continue) 
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4) Lesson 4: Planning - Starting a new artifact 
a. Why is it important to plan your task? 
b. What do I fill in here? understanding the terms: 
1. Description 


il. Criteria & rubrics 
ill. Task Goals 
IV. Strategies 


c. Helping students identify useful task goals and strategies 
d. How to do this in ePEARL 


5) Lesson 5: Doing 
a. Composing in the text editor 
b. Recording readings or music with the audio recorder 
c. Attaching multimedia files 
d. Reflecting on your task goals & strategies as you work 


6) Lesson 6: Reflecting on works in progress and completed works 
a. Sharing works with peers 
b. Giving constructive feedback 
c. Revising work based on feedback 
d. Saving as a new version 


Planning Tips: 


Train 4-5 tech-savvy students on ePEARL the day before you introduce each lesson so 
they can assist you with any questions or challenges the students have. 

Set up the equipment the afternoon before and practice logging in to make sure 
everything works properly, including your own username & password. 

Preview the virtual tutorial chapters and instructional videos on each topic before 
presenting the lesson to the students. These will provide you with a general overview of 
the key software features and teaching ideas that will help you to present that lesson. 

Join the WIKI — there are additional materials on this interactive forum to help with these 
lessons. You need to email a request to join this wiki and it may take 2-3 days to activate 
your account. To join write: emeyer@education.concordia.ca 


(Appendices continue) 
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Appendix B 





CSERie CEREAL 


Centre for the Study of Learning and Performance 
Centre d'études sur l'apprentissage et la performance 


CENTRE FOR THE STUDY OF LEARNING AND PERFORMANCE 
McConnell Building, 1455 de Maisonneuve Blvd. W., LB-581 
Montreal, Quebec, Canada H3G 1M8& 

Tel: (514) 848-2424 x2020 


Learning Strategies Questionnaire 


This questionnaire is part of a study being conducted by the Centre for the Study of Learning and 
Performance at Concordia University in Montreal, Quebec. We would like to know more about 
how you are learning this year. This questionnaire will help us learn about the strategies you are 
using in your class to help you with your work. 


Please answer the questions on the next page. There is no right or wrong answer. Your 
answers are confidential (no one that you know will be told what you answered). Your teacher 
will not have access to your answers. You have the right to refuse, to participate, or to withdraw 
(stop answering the questions) at any time. However, your experiences and opinions are 
important, and will help us understand teaching from your point of view. 


Thank you for your collaboration! 


Vanitha Pillay, Research Coordinator, CSLP 
Phil Abrami, Professor and Director, CSLP 
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PERSONAL INFORMATION 

e Name: eee arrereeer ern 
e Gender: Boy Girl 

e School: Grade 
INSTRUCTIONS 


Please the most appropriate response when answering the questions. 
In my class... 


1. | set my own learning goals (I decide what | need to learn). 


Strongly Disagree Undecided Agree Strongly 
Disagree Agree 


2. | set my own process goals (I list what | need to do to achieve my learning goals). 


Strongly Disagree ~ Undecided Agree Strongly 
disagree agree 


3. identify strategies for achieving my goals. 


Strongly Disagree Undecided Agree Strongly 
disagree agree 


4. | revise my goals when necessary. 


Strongly Disagree Undecided Agree Strongly 
disagree agree 


5. |am motivated to learn. 


Strongly Disagree Undecided Agree Strongly 
disagree agree 


6. | explain what | need to do when | get an assignment. 


Strongly Disagree Undecided Agree Strongly 
disagree agree 

7. | list the strategies I’m using when | work on assignments. 
Strongly Disagree Undecided Agree Strongly 
disagree agree 


8. | check my progress towards achieving my goals. 


Strongly Disagree Undecided Agree Strongly 
disagree agree 


9. | modify (correct) my actions on my own to achieve my goals. 


Strongly Disagree Undecided Agree Strongly 
disagree agree 
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. | modify (correct) strategies that are not helping me achieve my goals. 


Strongly Disagree Undecided Agree Strongly 
disagree agree 


. | give helpful advice to my classmate on their work. 


Strongly Disagree Undecided Agree Strongly 
disagree agree 


. |US@ comments from my teacher to improve on my work. 


Strongly Disagree Undecided Agree Strongly 
disagree agree 


. |US@ comments from my classmate to improve on my work. 


Strongly Disagree Undecided Agree Strongly 
disagree agree 


| USE comments from my family to improve on my work. 


Strongly Disagree Undecided Agree Strongly 
disagree agree 


| revise versions of my work to improve them. 


Strongly Disagree Undecided Agree Strongly 
disagree agree 


| reflect on the strategies | used to achieve my goals. 


Strongly Disagree Undecided Agree Strongly 
disagree agree 


| evaluate my own work (I look at my work to see if it is good or needs improvement). 


Strongly Disagree Undecided Agree Strongly 
disagree agree 


| Know how | am being evaluated. 


Strongly Disagree Undecided Agree Strongly 
disagree agree 


| make connections between the amount of time | spend on my work, and my achievement. 


Strongly Disagree Undecided Agree Strongly 
disagree agree 


| work well with other students. 


Strongly Disagree Undecided Agree Strongly 
disagree agree 
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SECTION 2: ePEARL USE 


Please answer this section ONLY if you have used the ePearl software in your class 


| liked using ePearl in my class because... 














| did not like using ePearl in my class because... 


ee 








ePearl helped me learn how to... 


| would like to use ePearl again next year because... 


| do not want to use ePearl again next year because... 


What | liked the most about using ePearl is... 


What | liked the least about ePearl is... 














Thank you again for your collaboration! 
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Universal Design for Learning and Elementary School Science: Exploring 
the Efficacy, Use, and Perceptions of a Web-Based Science Notebook 


Gabrielle Rappolt-Schlichtmann 
CAST, Inc., Wakefield, Massachusetts, and Harvard Graduate 
School of Education 


Samantha G. Daley, Seoin Lim, Scott Lapinski, 


Kristin H. Robinson, and Mindy Johnson 
CAST, Inc., Wakefield, Massachusetts 


Science notebooks can play a critical role in activity-based science learning, but the tasks of recording, 
organizing, analyzing, and interpreting data create barriers that impede science learning for many 
students. This study (a) assessed in a randomized controlled trial the potential for a web-based science 
notebook designed using the Universal Design for Learning (UDL) framework to overcome the chal- 
lenges inherent in traditional science notebooks, (b) explored how teacher characteristics and student use 
of supports in the digital environment were associated with productive inquiry science learning behay- 
iors, and (c) investigated students’ and teachers’ perceptions of the key affordances and challenges of the 
technology to their learning. Use of the UDL science notebook resulted in improved science content 
learning outcomes (yy = .34, p < .01), as compared with traditional paper-and-pencil science notebooks, 
and positively impacted student performance to the same degree, regardless of reading and writing 
proficiency and motivation for science learning at pretest. Students of teachers with greater experience 
using science notebooks and students who more frequently used the contextual supports within the 
notebook demonstrated more positive outcomes. Students and teachers reported overall quite positive 
experiences with the notebook, emphasizing high levels of interest, feelings of competence, and 
autonomy. 


Keywords: Universal Design for Learning, science notebook, elementary education, technology, design 


based research 


Modern science education emphasizes learning that integrates 
higher order thinking skills with content-area knowledge in au- 
thentic problem-solving activities (Kame’enui & Carnine, 1998). 
Students are expected to learn actively through observation and 
interaction, rather than direct instruction (Chinn & Malhotra, 2002; 
Mastropieri, Scruggs, Boon, & Carter, 2001; National Research 
Council, 1996). Such active science learning requires students to 
develop and use a number of complex skills (Chinn & Malhotra, 
2002), including making claims, observing, collecting information 
and data, analyzing, drawing conclusions, and presenting findings. 
Students need to build explanations by connecting their observa- 
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tions during inquiry science experiences to claims about what their 
observations might mean (McNeill & Krajcik, 2006; National 
Research Council, 2000; Sandoval & Reiser, 2004). 

Within the mandate for active science learning, educators face a 
daunting set of challenges. The No Child Left Behind Act (2001) 
sharply increased accountability for raising the science achieve- 
ment of all students (Individuals with Disabilities Education Act, 
2004), and the student population is increasingly diverse. Although 
active science learning presents challenges for all students (De 
Jong & Van Joolingen, 1998; Keselman, 2003), the processes 
required may present particular difficulty for those who struggle 
with reading and writing (Englert, Raphael, Fear, & Anderson, 
1988; Graham, 1990; Graham, Harris, MacArthur & Schwartz, 
1991; Swanson, 1999; Wong, 2001), or who otherwise have low 
motivation for science learning (Keselman, 2003; Scruggs & Mas- 
tropieri, 1994). 

Students may struggle not only with understanding the sci- 
ence concepts, and, in some cases, not even with the scientific 
inquiry process, but also with aspects of the active learning 
experience that are unintentional barriers to the deep learning 
being pursued. Sophisticated learning tools provide an oppor- 
tunity to address these unintended, construct-irrelevant barriers. 
Such tools can assist teachers to more effectively and efficiently 
support students throughout the active science learning process 
and foster the skills and behaviors that are most productive in 
science learning. 

In this article, we first describe a Universal Design for Learning 
(UDL) web-based science notebook designed to support elemen- 


UDL SCIENCE NOTEBOOK 


tary school students and their teachers during active science learn- 
ing. We then report the results of a study that (a) assesses the 
potential impact of this web-based science notebook to support 
improved content knowledge outcomes as compared with tradi- 
tional paper-and-pencil science notebooks, (b) explores the factors 
that contribute to students’ effective active science learning be- 
haviors in the web-based notebook environment, and (c) investi- 
gates students’ and teachers’ perceptions of the key affordances 
and challenges of the technology. We provide this as a demon- 
stration of the potential for UDL technology that provides options 
to address the variability in knowledge, skills, and preferences 
within an elementary school classroom. Such UDL technology can 
overcome barriers and support learning beyond the capabilities of 
more static learning tools. 


Building a Better Science Notebook 


Science notebooks are widely used to support the active science 
learning process and the development of scientific literacy (Har- 
grove & Nesbit, 2003; Klentschy, 2005), offering students the 
opportunity to engage in authentic scientific thinking (Hargrove & 
Nesbit, 2003) and providing teachers with opportunities for em- 
bedded formative assessment (Hargrove & Nesbit, 2003; 
Klentschy, Garrison, & Amaral, 1999; Shepardson & Britsch, 
2004). When used effectively, science notebooks can support 
students to develop critical thinking and conceptual understanding 
(Keys, 2000; Miller & Calfee, 2004). 

But, the research literature indicates that science notebooks are 
typically and primarily used in a mechanical way to record data, 
procedures, or definitions—and rarely to support development of 
deep understandings through the active science learning process 
(Baxter, Bass, & Glaser, 2001; Ruiz-Primo, Li, Ayala, & Shav- 
elson, 2004). Furthermore, science notebooks present multiple 
barriers to students who struggle in the learning process because 
they require relative proficiency in reading and writing to be 
useful. Without high enough skill levels in these domains, students 
are unable to use notebooks to support the development of deep 
understandings through activity-based learning. 

Designing a science notebook that incorporates the principles of 
UDL provides an opportunity to address both potential pitfalls in 
the use of science notebooks—UDL focuses on supporting the 
deep learning process and overcoming unnecessary barriers to 
such learning. UDL is a transdisciplinary framework that facili- 
tates interaction between researchers from the learning sciences 
and professionals within education, focused on problems of com- 
mon interest. The UDL framework can be used to reach a holistic 
understanding and work toward innovative solutions (Rappolt- 
Schlichtmann, Daley, & Rose, 2012; Rappolt-Schlichtmann & 
Watamura, 2010; Samuels, 2009). 

The basic premise of UDL is that barriers to learning occur in 
the interaction with curriculum—they are not inherent solely in the 
capacities of the learner. Just as universally designed buildings 
provide options that accommodate a broad spectrum of users, 
universally designed tools and curricula offer a range of options 
for accessing and engaging with learning materials. “Universal” 
does not mean “one size fits all”; rather, it implies that curricula 
and materials are conceived of and designed to accommodate the 
widest possible range of learner needs and preferences. 
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Advances in technology have made the development of UDL 
approaches, texts, content curricula, strategy-based interventions, 
and assessment possible (Dolan, Hall, Banerjee, Chun, & Strang- 
man, 2005; Rose & Meyer, 2002, 2006). Developed under’ the 
UDL framework, digital environments provide the necessary in- 
frastructure and flexibility to allow for the creation of accessible, 
highly effective apprenticeship environments where students are 
actively guided in the process of constructing meaning through the 
provision of just-in-time feedback and contextual supports that can 
be gradually withdrawn as student expertise increases (Cognition 
and Technology Group at Vanderbilt, 1993; Collins, Brown, & 
Newman, 1989; Palincsar, 1986, 1998; Palinscar & Brown, 1984). 
Through this kind of design approach, teachers can be supported 
and provided with the flexible tools they need to create more 
effective and differentiated learning experiences for students (Dal- 
ton, Pisha, Eagleton, Coyne, & Deysher, 2002). 


The Universally Designed for Learning Science 
Notebook (UDSN) 


Like traditional science notebooks, the UDSN provides students 
with (a) space to collect, organize, and display observations and 
data; (b) space to reflect and make sense of inquiry experiences; 
and (c) multiple opportunities to demonstrate understanding and 
receive formative feedback. But with UDL as the design frame- 
work (CAST, 2011) and the potential of digital technology as the 
platform, the UDSN differs from traditional science notebooks in 
several key ways. 

First, the UDSN was designed with a purposeful focus on 
lowering construct-irrelevant barriers to science learning. It was 
thus developed according to accessibility guidelines from the 
World Wide Web Consortium (W3C-WAI, 1999), Section 508 of 
the Rehabilitation Act (29 U.S.C. 794d), and the National Center 
for Accessible Media (2006). Text-to-speech technology is built 
directly into the notebook interface, as well as word-by-word 
English-to-Spanish translation, alt text and long descriptions for 
images, all actions are keyboard accessible, and a multimedia 
glossary is provided to provide just-in-time support for vocabulary 
use and development (see Figure 1). These features overcome such 
barriers for the many students whose literacy skills would interfere 
with the efficacy of materials that depend on proficiency in reading 
and writing (Hsu, 2004; Klecan-Aker & Caraway, 1997; Scruggs, 
Mastropieri, Bakken, & Brigham, 1993; Storch & Whitehurst, 
2002; B. Y. White & Frederiksen, 1998), as well as for those 
students who use accessibility features due to sensory or motoric 
limitations, those for whom proficiency in English is a barrier, and 
others who would more effectively learn through use of built-in 
accessibility features. 

Access to materials and tools, as is provided through the 
features described above, is an important advantage that can be 
built into digital technologies, but UDL offers another level of 
design advantage—access to learning. UDL places a premium 
on the use of contextual support (CAST, 2011) that, in this case, 
is intended to develop and then reinforce effective science 
learning behaviors. Pedagogy is built into the interface design 
itself, guiding students and teachers in the process of active 
science learning, and specifically the effective use of science 
notebooks. For example, the navigation of the UDSN (see 
Figure 1A) provides a conceptual anchor in both words and 
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Figure 1. 


pictures. As students complete each part of a science activity, 
they are reminded through the navigational structure of the 
UDSN that they are moving through a process: plan, get data, 
explain. Once students begin to build an explanation for their 
inquiry experience, they are further provided with contextual 
supports to facilitate, guide, and then reinforce the process 
behaviors necessary for effective science notebook use. The 
“Show Me” feature (see Figure 1B) provides brief captioned 
videos, where students are guided in how to go about building 
an explanation—students are prompted to think about “How do 
I start?” “What’s next?” and “What does it mean?” Students can 
use the “Check My Work” support (see Figure | C) to be sure 
their explanations contain all of the necessary components. 
Among other things, they are prompted to think about making 
direct reference to their data and observations and to use 
relevant vocabulary from their inquiry experiences. Further- 
more, students can choose to express their thinking through any 
of a variety of multimedia response options, including typing, 
drawing, audio recording, or uploading a picture (see Figure 2). 

In addition to incorporating pedagogy in the student-facing 
interface, active science learning is facilitated through the role 
of the teacher in using the UDSN. Teachers are supported in the 
process of providing feedback to students (see Figure 3). Feed- 
back is a necessary catalyst for self-regulated, effortful, and 
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persistent learning behavior, and formative assessment is 
needed so that teachers can be successful at differentiating the 
affordances of the UDSN tool to student strengths and weak- 
nesses (Butler & Winne, 1995). In the UDSN teachers can view 
all of their students’ explanations at once or one at a time, and 
make quick notes to themselves about whether the student “got” 
the concept or not (see Figure 3A). Teachers are provided 
“What to look for” information (see Figure 3B), including core 
concepts, common misconceptions, and model feedback. 
“Teacher timesaver” (see Figure 3C) provides sentence starters 
for feedback that is process oriented and catalogues recent 
feedback the teacher has given to other students. Teachers are 
prompted and supported to provide feedback that may include 
corrective information, alternative strategies, information to 
clarify ideas, or encouragement to engage in the scientific 
process (Hattie & Timperley, 2007). 

Students can then use teacher-provided feedback to pursue 
active science learning and engage in productive science learning 
behaviors. In response to feedback, students are prompted to 
revisit observations (see, e.g., Figure 1D) to revise their explana- 
tions and then add a “Line of Learning” to their notebook (see 
Figure 1E). When students add a line of learning, they are indi- 
cating a shift in their thinking in response to feedback or new 
information gathered through inquiry experiences. In this way they 
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Figure 2. Multimedia response options and supports for students, Universally Designed for Learning Science 


Notebook. 


revise their ideas but retain the string of explanations they have 
built—the evolution of their thinking is evident and emphasized in 
the interface of the UDSN. 


Method 


The UDSN was developed through a process of progressive 
refinement using design-based research methodology. Design- 
based research is a formative evaluation approach to intervention 
and technology development where the goal is to refine both 
existing theory from the research literature and to generate “usable 
knowledge” to improve educational practice (Collins, Joseph, & 
Bielaczyc, 2004; Flagg, 1990; Reeves & Hedberg, 2003). Within 
this framework, development and research take place through 
continuous cycles of design, implementation, analysis, and rede- 
sign (Cobb, 2001; Collins, 1992). The current study represents the 
summative research associated with this work (Rappolt- 
Schlichtmann, Daley, Lim, Robinson, & Johnson, 2011). 

The overarching goal of the research was to determine whether 
the UDSN enhanced the science learning of diverse fourth-grade 


students in authentic public school settings relative to paper-and- 
pencil science notebooks and to understand student and teacher 
experiences in using the web-based science notebook at the fourth- 
grade level. Furthermore, we sought to quantitatively and qualita- 
tively explore the mechanisms by which the UDSN operates on 
student science learning. The following research questions were 
posed: 

e Research Question (RQ) 1 (Overall Impact): On average, do 
students in classrooms using support-rich, UDL science notebooks 
learn and understand more about science than similar students in 
similar classrooms using traditional paper-and-pencil science note- 
books? 

* RQ 2 (Differential Impact): On average, is the impact of the 
UDSN the same for students at various reading and motivation 
levels? 

* RQ 3 (Use): Do students use the UDSN in ways that would 
indicate productive science notebook use? 

¢ RQ 4 (Differential Use): Do students whose teachers have 
more professional experience and students who more frequently 
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Figure 3. Teacher interface, Universally Designed for Learning Science Notebook. 


use the contextual process-focused supports in the UDSN tend to 
engage in more productive science notebook learning behaviors? 

¢ RQ 5 (Perceptions): What are students’ and teachers’ percep- 
tions of the usefulness of the UDSN in science learning? 


Site and Participants 


The data were collected from eight schools within a large 
southeastern school district in the United States. The district con- 
sisted of rural, suburban, and urban schools. The sample consisted 
of 621 fourth-grade students from 28 different classrooms and 22 
teachers. Overall, the student sample consisted of nearly 35% 
minority students, and 10% of the students were characterized as 
having an Individualized Education Program (IEP), or a 504 plan. 


Procedure 


To answer RQ 1—4, we conducted a randomized controlled trial 
that investigated the potential of the notebooks to support the 
development of student content knowledge, as compared with 
traditional science notebooks. Over the course of the 8-10 weeks 
of the study, fourth-grade students used either the UDSN note- 
books (treatment group) or traditional science notebooks (control 


group) as they completed the magnetism and electricity unit of the 
Full Option Science System (FOSS) curriculum. One of the most 
widely used in the United States, FOSS is a research-based, K-8 
science curriculum developed by the Lawrence Hall of Science 
(Banilower, 2002). Students and teachers regularly use paper-and- 
pencil science notebooks as a part of this curriculum. 
Participating teachers were randomized to either the treatment 
or the control group in two steps. First, pairs of teachers within 
each school were identified using a matching index of teacher 
experience and classroom demographics. Second, from the pairs, 
we randomly assigned one teacher to the control group and one to 
the treatment group. All teachers who participated in the research 
study, regardless of group assignment, received a 1-day FOSS 
training session focused on the use of science notebooks to support 
science learning at the elementary school level and a 1-hr high- 
level orientation to the UDL framework. Teachers in the treatment 
group also participated in an additional 2 hr of training, where they 
were made aware of the additional features the UDSN provided 
above and beyond paper-and-pencil notebooks, and had an oppor- 
tunity to practice using those features as if they were students. 
Teachers in all conditions received follow-up support 1 and 2 
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weeks after beginning the implementation, and then technical 
Support for the remainder of the implementation. 

Prior to beginning the intervention, students in the treatment 
condition were given training on the use of UDSN by one of two 
project team members. This included an overview of the UDSN 
features, as well as a .5-hr session during which students were able 
to try the UDSN and ask questions. The project team also spent | 
week within each school ensuring the technology (computers and 
Internet access) was sufficient for implementation of UDSN. In 
one instance when technology was a barrier, we provided the 
classroom with an extra router to allow the use of a laptop cart. 

To answer RQS, we conducted focus groups with students and 
interviews with teachers who used the UDSN during the summa- 
tive research study described above. Eighty-four students with 
experience using the UDSN (six students from each of the 14 
experimental classrooms) were selected to participate in focus 
groups. The selection process was purposeful so as to include both 
high- and low-achieving students, students with and without IEPs, 
and an equal number of boys and girls. All 11 teachers who 
implemented the UDSN as a part of the summative study were 
interviewed. Focus groups with students and interviews with 
teachers were conducted within 2 weeks of the conclusion of the 
study period and lasted 60-75 min. Students and teachers were 
asked the same set of semistructured questions (e.g., for students, 
“Do you think your UDSN helped you learn science or not?; 
How?”; “Were there times when your UDSN made it more diffi- 
cult or got in the way of your science learning?”; “If you think the 
UDSN helped you learn science, describe something specific you 
did in your UDSN that helped you learn science”; “What about 
your UDSN was frustrating or difficult?”), but the format of these 
sessions was open-ended so as to allow for the spontaneous elab- 
oration of thoughts. Focus groups and interviews were taped and 
then transcribed for later coding. 


Measures 


Magnetism and electricity content knowledge (Assessing Sci- 
ence Knowledge [ASK]). The ASK Survey (Ferguson, Long, & 
Kennedy, 2009) was used to measure changes in knowledge of 
science content taught within the FOSS curriculum. Student pro- 
ficiencies were computed using a maximum likelihood estimation 
method using ConQuest software. Student proficiencies are com- 
puted as the location on a content-specific scale that is most 
probable for students given their responses to the items. Because 
student proficiency estimates are not provided to teachers directly, 
we did not scale the raw logit scores onto an external positive scale 
such as 0-100. Thus, student proficiency estimates ranged be- 
tween —6.0 and 6.0. 

Motivation for science (Motivation for Science [MFS] 
Inventory). Due to the lack of published assessments on student 
motivation in science, we developed the MFS inventory. After an 
extensive literature review, we derived questions representing four 
constructs key to student motivation at the elementary level: self- 
efficacy, interest, desire for challenge, and social behavior. The 
survey consisted of 23 Likert-type items (“I am good at science,” 
“T like science when it is hard and challenging”). Students rated 
items on a scale ranging from 1 (very different from me) to 4 (a lot 
like me). Questions related to the “social” construct did not reliably 
hang together and so were eliminated during the pilot test phase of 
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this project. Reliability for the pilot test sample of the MFS survey 
was .85; for the experimental sample, it was .89. 

Reading and writing proficiency (Measure of Academic 
Progress [MAP]). The district administered the MAP to assess 
students’ proficiency in reading and writing; scores were collected 
with permission from participants. The MAP assessments are 
computerized-adaptive tests developed by the Northwest Evalua- 
tion Association (NWEA) and used in 2,570 school systems across 
49 states nationwide to assess student proficiency in a variety of 
areas including English Language Arts. The MAP tests were 
developed from large pools of items that have been calibrated for 
their difficulty. Prior research indicates the MAP assessments are 
valid and reliable indicators of student proficiency in targeted 
areas (NWEA, 2005). 

Electronic usage log. The web-based UDSN includes an elec- 
tronic usage log. This log allows researchers to view items clicked, 
including use of navigation, supports, choices of whether to type, 
draw, or audio-record responses, and other patterns of use in time 
series within the program for both students and teachers. Using this 
log, we can determine the frequency of use and usage patterns, 
helping us to determine in what way components are being used, 
thus informing design, organization, and content. 

Teacher background. Teacher background characteristics 
were collected via questionnaire. This information included age, 
years of teaching experience, years of teaching at the current grade 
level, years teaching with the FOSS curriculum, years teaching 
using science notebooks, number of hours of science-focused 
professional development in the past 5 years, and hours spent 
teaching science each week. 


Results 


Quantitative Analysis 


To answer RQ 1-4, we used a multilevel modeling approach. 
Multilevel modeling can account for measurement and sampling 
error when variables operate at different but related levels of 
organization, resulting in correctly adjusted standard errors for the 
treatment effect (Raudenbush & Bryk, 2002; Singer, 1998). To 
answer the first two research questions, we fit a series of three- 
level models in which students were clustered within teachers, and 
teachers were clustered within schools. To answer RQ 3 and 4, we 
fit two-level models (students within schools) because these ques- 
tions deal only with the treatment group, making the classroom and 
school levels of analysis synonymous due to our sampling and 
randomization procedure. Continuous covariates were grand-mean 
centered, whereas categorical variables were represented as 0/1 
indicators. Goodness of fit was assessed using the —2 log- 
likelihood and Akaike’s information criterion statistic where 
smaller is better. Variables were systematically added to the model 
and then maintained if significant. All models were fit using either 
the PROC MIXED procedure within SAS or xtmixed within Stata. 

RQ 1 (Overall Impact). On average, do students in class- 
rooms using support-rich UDL science notebooks learn and un- 
derstand more about science than similar students in similar class- 
rooms using traditional paper-and-pencil science notebooks? We 
list the sample means and standard deviations for all variables 
included in our analyses in Table 1. Note that at posttest, content 
knowledge was relatively higher among students in the treatment 
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Table | 
Mean Performance Scores and Covariates of Students in the Treatment and Control Groups 
Treatment 
Variable n M (SD) 
Outcome (posttest) 
M&E Knowledge 355 42 (.9) 
Predictors (pretest) 
M&E Knowledge 355 —1.4(.7) 
MAP, Language Arts 355 217 (10.8) 
Motivation for Science 346 ANS (a) 


Note. 


group (M = .42, SD = .9) than in the control group (M = .01, 
SD = .9). Though the standard deviation from the mean for 
content knowledge at posttest was the same for both groups, the 
range of values in the treatment group (—1.8, 3.7) was smaller as 
compared with the control group (—4.6, 2.8). Mean values and 
standard deviations between treatment and control on the MAP 
language arts test, motivation for science survey, and magnetism 
and electricity content assessment were similar at pretest, though 
values on the content knowledge pretest were slightly higher in the 
treatment group (M = —1.4, SD = .7) as compared with control 
(M = —1.7, SD = .9). We controlled for pretest content knowl- 
edge in our model of impact. 

In Table 2, we present the baseline control and final model, as 
well as a series of models describing interaction tests between 
covariates and treatment extracted from the larger taxonomy of 
models that we fit systematically during data analysis. We present 
estimates and goodness-of-fit statistics for both the fixed effects 
and the variance components, along with goodness-of-fit statistics 
for the overall model. RQ 1 concerns the overall impact of the 


Table 2 


Control 
Range n M (SD) Range 
—1.8-3.7 168 .O1 (.9) —4.6-2.8 
-4.6-.61 168 eee) —4.6-.61 
172-280 168 215 (12.4) 163-243 
16-52 168 40.6 (7.3) 19-52 


M&E knowledge = Magnetism and Electricity Knowledge; MAP = Measure of Academic Progress. 


UDSN. We find that students using the UDSN demonstrated 
greater knowledge of the science content at posttest than peers 
using traditional science notebooks (y = .32, p < .05), controlling 
for pretest levels of content knowledge (y = .25, p < .001), 
reading skills (y = .04, p < .001), and motivation for science (y = 
01, p < .05) (see Table 2, Final column; overall model fit at 
x°[3] = 388, as compared with baseline). 

Interpreting the parameter for the fixed effect associated with 
treatment indicates that, on average, students in treatment class- 
rooms exhibit proficiency scores .32 points higher than students in 
control classrooms. Forty-four percent of the explainable variation 
in the ASK posttest is explained by assignment to the treatment 
group in this model. 

RQ 2 (Differential Impact). On average, is the impact of the 
UDSN the same for students at various reading and motivation 
levels? A pair of interactions was added individually to the overall 
impact model to determine whether differential effects of the 
UDSN were evident for students with varying levels of pretest 
reading skills and motivation for science. Neither the interaction 


Fixed Effects Estimates (Top) and Variance—Covariance Estimates (Bottom) for Models Describing Predictors of Knowledge of 
Magnetism and Electricity at Posttest, Including the Impact of Condition 











Parameter Baseline Final Covariates Interaction 1 Interaction 2 

Fixed effects-Level 1 

Intercept ON GLA) ei raaiGlG)s SED 43 (.16)* 49 (.16)* 

ASK pretest .49 (.05)*** 2510 5)\ea. =2.61(60)5))inee 22 5)(C05)) aan Dili" 

MAP .04 (.003)*** .04 (.003)*** .04 (.006)*** .04 (.003)*** 

Motivation .O1 (.004)* 01 (.004)* .O1 (.004)* .O1 (.008) 
Fixed effects-Level 2 

Treatment B2Gle); SA )i 2a) 

Treatment MAP .004 (.01) 

Treatment X Motivation — .002 (.01) 
Random effects 

Between schools .03 (.06) .05 (.06) .O1 (.06) .05 (.06) .05 (.06) 

Between teachers .16 (.07)* .05 (.03)* .09 (.06)* .05 (.04)7 .04 (.03)* 

Within teachers 2551603) tay .39 (.03)""™* 903) ian 39/(03)ia, 23.9503) 
Goodness of fit 

—2LE 1,219 831 834 839 833 

IG 1225) 837 840 845 839 
Note. Multilevel modeling (students, within teachers, within schools) was used to estimate the effects of treatment (Raudenbush & Bryk, 2002). All 


models were fit using the PROC MIXED procedure within SAS. Standard errors are in parentheses. ASK pretest = Full Option Science Systems Assessing 
Science Knowledge Pretest for Magnetism and Electricity, MAP = Measures of Academic Progress for Language Arts; Treatment = teacher random 
assignment to Universally Designed for Learning Science Notebook (1) or Traditional Science Notebook (0); Motivation = Motivation for Science Survey; 


—2LL = —2 log-likelihood; AIC = Akaike’s information criterion. 
ip = MO p< Op, a <= Oo < OH, 
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between treatment and pretest reading skills on the MAP assess- 
ment (y = .004, p > .05) nor the interaction between treatment and 
motivation for science (y = —.002, p > .05) was statistically 
significant or improved the overall fit of the model (see Table 2, 
Interaction | and 2, respectively). Thus, the relationship between 
condition and posttest score did not differ by reading level, nor did 
the relationship between condition and posttest score differ by 
students’ levels of motivation for science. 

RQ 3 (Use). Do students use the UDSN in ways that would 
indicate productive science notebook use? To answer RQ 3, we 
conducted an exploratory analysis of student use of the UDSN 
(treatment condition only). We constructed metrics that describe 
student behaviors derived from click-by-click usage signals col- 
lected via the electronic usage log. Given the UDL framework and 
the goals of our design process, and after looking to the research 
literature, we were able to identify four behaviors that could be 
extrapolated from student clicks catalogued in time series: (a) 
number of sessions using the UDSN (overall use), (b) number of 
completed entries in which students were asked to explain key 
science concepts (reflective consolidation and demonstration of 
knowledge), (c) reviewing data/observations when asked to con- 
struct explanations of inquiry experiences (use of data or infer- 
ences from observations), and (d) revision of previous entries to 
reflect new understanding or teacher feedback (continuous learn- 
ing and recursion on ideas). 

In Table 3, we provide descriptive statistics for these four 
behaviors, totaled across the entire implementation period, aver- 
aged by number of sessions each student used the UDSN, and, 
where appropriate, averaged across the number of key concept 
explanations completed. This set of descriptive statistics provides 
both an overall picture of use across the 8-to 10-week implemen- 
tation period and an idea of how often each discrete behavior was 
evident for each student. 

As Table 3 indicates, students used the UDSN an average of 10 
times across the implementation period, or approximately once per 
week. This varied, however, with some using it only a handful of 
times, and others using it 2 or 3 times the average amount. 
Students answered the key explanation questions an average of 12 
times, or, again, just over once per week. But, the range in number 
of responses approximated the range in total number of UDSN 
sessions across the implementation period, and the number of key 
concept entries averaged to once per session. 

Revisions to previous notebook entries and reviewing data in 
order to answer the key concept questions were somewhat more 


Table 3 
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discrete process behaviors that reflected not only what was com- 
pleted but also how students went about completing their work in 
the UDSN. On average, students revised about four notebook 
entries over the course of the entire implementation period, but the 
variation in this activity was substantial. Although almost 20% of 
students never revised a post, almost 10% of students revised more 
than 10 posts. Students overall averaged a revision to a post in 
about one out of three UDSN sessions, and this was almost 
equivalent to the average per key concept question; most revisions 
to previous posts were revisions to responses to the key concept 
questions. Students averaged 36 instances of looking at a data- 
focused page in the UDSN before completing an answer to a key 
concept question, with a dramatically wide range from no in- 
stances to 149. Ten percent of students logged 69 or more in- 
stances of this key science process behavior, and the average was 
three per session and per key concept entry. 

RQ 4 (Differential Use). Do students whose teachers have 
more professional experience and students who more frequently 
use the contextual process-focused supports in the UDSN tend to 
engage in more productive science notebook learning behaviors? 
The variability in students’ science notebook learning behavior 
described above makes clear the need to understand what charac- 
teristics potentially promote such productive behaviors. We con- 
sidered two sets of potential characteristics: the experience level of 
the student’s teacher and the frequency with which the student 
engaged with the relevant supports in the UDSN web-based envi- 
ronment designed to promote the emergence of such behaviors. 
The four productive behaviors described above (number of UDSN 
sessions, number of key concept questions answered, number of 
edited posts, and number of reviews of data when answering key 
concept questions) were highly correlated with each other (r 
ranged from .56 to .74; all ps < .001); the total of these behaviors 
across the implementation period was combined in a single com- 
posite using a principal components analysis for the purpose of 
data reduction. All four behaviors loaded on a single factor. 

Using multilevel models in which students are clustered within 
teachers, we fit a set of models examining each type of character- 
istic predicting the composite outcome of productive behaviors. 
We present the results of the taxonomy of models in Table 4. The 
three teacher experience variables of interest were years of expe- 
rience using notebooks in science instruction, years of overall 
teaching experience, and years of experience with the FOSS cur- 
riculum. The goodness of fit of each model was compared with a 
baseline control model, including pretest content knowledge. 


Mean Frequency of Students’ Productive Notebooking Behaviors Across the Intervention Period, Averaged Across Number of UDSN 
Sessions, and Averaged Across Number of Explanations of Key Concepts Entered 





Average per key 


Total across Average per UDSN concept explanation 








intervention period session entered 
Variable n M (SD) Range M (SD) Range M (SD) Range 
UDSN sessions 411 10.48 (4.3) 1-30 == moe = oa 
Key concept explanations entered 411 11.52 (6.2) 0-33 1.07 (.43) 0-2 — = 
Revisions to previous notebook entries 41] 3.86 (4.3) 0-33 32 (.26) 0-2 321629) 0-3 
Reviewing data when entering key concept explanations 411 35.78 (24.8) 0-149 Be shh (LD) 0-19 3.20 (1.9) 0-15 








Note. UDSN = Universally Designed for Learning Science Notebook. Dashes indicate that data are not applicable. 
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Table 4 
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Fixed Effects Estimates (Top) and Variance—Covariance Estimates (Bottom) for Models Describing Predictors of Process Behaviors 


Using the UDSN 








Parameter Model | Model 2 Model 3 Model 4 Model 5 Model 6 
Fixed effects-Level | ae 
Intercept 32.91 (3.67)""* 25:47) (483) 2502G2) = 28.87 (5.87)*** 29.58 (3.85)*** 22.50 (4.49)" 
ASK pretest 3.06 (1.04)*"" 3.03 (1.04)** Bou CeOs ie 3.08 (1.04)™* 2.67 (1.05)* 2.654 (1.04)’ 
Use of process supports 0.27 (3.85)** 0.262 (.10) 
Fixed effects-Level 2 
Years of notebook use BOW (SB) 127 (Cae 
Years of teaching experience 0.57 (.49) 
Years using FOSS curriculum 0.54 (.61) 
Random effects 
Between teachers Oi Osa" OAK(QH 2s 9.84 (2.70)*** 10:22 @76); = LOO2QS4) 7.99 (2.22)*** 


Within teachers 13105150) 1S:051G50)) = 


Goodness of fit 


13:051(50)i 


13105,(050); 12.94 (.50)*™* 12.94 (.50)""* 











See ey 2,828 2,823 2,827 2,827 2,825 2,819 
AIC 2,837 2,833 2,837 2,837 DOD 2,831 
Note. UDSN = Universally Designed for Learning Science Notebook; ASK pretest = Full Option Science Systems (FOSS) Assessing Science 
Knowledge Pretest for Magnetism and Electricity; —2LL = —2 log-likelihood; AIC = Akaike’s information criterion. 
op 05S) ep Olam pe (0 I 


The parameters for Model 2 indicate a statistically significant, 
positive effect of teachers having had more experience using 
science notebooks in instruction (y = 1.307, p > .05); controlling 
for pretest content knowledge, students of teachers with one more 
year using science notebooks, on average, exhibit 1.3 more pro- 
ductive behaviors. Neither years of overall teaching experience 
(Model 3) nor years of experience with the FOSS curriculum 
(Model 4) were statistically significant predictors of student pro- 
ductive behaviors. 

As was the case for the productive behaviors used as the 
outcome in these analyses, the use of supports intended to encour- 
age these behaviors in the UDSN were also highly correlated with 
each other. Students who used the “check my work” coach also 
tended to access agents to get guidance for understanding ques- 
tions or answering questions and tended to open videos designed to 
provide guidance on scientific thought processes. R values be- 
tween the three ranged from .33 to .53, all with p < .001. Like the 
outcome behaviors, we combined total use of these supports into a 
single composite using principal components analysis for the pur- 
pose of data reduction in which all loaded on a single factor. 

As shown in Model 5 of Table 4, the composite of use of 
process-focused supports in the UDSN was a positive and statis- 
tically significant predictor (y = .27, p > .01) of productive 
science notebook behaviors; controlling for pretest content knowl- 
edge, students who used the contextual supports more frequently 
were more likely to engage in the desired outcome behaviors. In 
Model 6, we provide a final model combining the two substantive 
predictors of the teacher’s years of experience using science note- 
books in instruction (y = 1.27, p > .05) and students’ use of 
contextual, process-focused supports in the UDSN (y = .26, p > 
.01); both remain positive and statistically significant, controlling 
for student pretest content knowledge (y = 2.65, p > .05). 


Qualitative Analysis 


To answer RQ 5 (what are students’ and teachers’ perceptions of 
the usefulness of the UDSN to science learning?) a grounded 


approach to qualitative data analysis was used. Because the UDSN 
is the first web-based program of this kind, we felt it was important 
to stay as close to the data as possible, allowing for the emergence 
of unexpected themes and serendipitous findings. Student focus 
groups and teacher interviews were coded and analyzed by the first 
author and two research assistants using the constant comparative 
method first developed by Glaser in 1965 (Glaser & Strauss, 
1967). An inductive approach was adopted whereby categories and 
thematic connections were identified within germane units of data 
through a reductionist process (Creswell, 2007; LeCompte & Pre- 
issle, 1992; Miles & Hubberman, 1984). 

An electronic database was generated using the qualitative anal- 
ysis software NUD*IST nVivo (2002). Conceptually salient com- 
ments were marked with a series of codes and then extracted from 
the text. Codes were grouped into concepts. Thematic categories 
were formed. To validate the formation of concepts and thematic 
categories, transcripts were scoured for negative cases and discon- 
firming evidence. Once all relevant data were grouped into the- 
matic categories (saturation), hypotheses describing student and 
teacher perceptions as to the usefulness of the UDSN in science 
learning were formed. Five thematic categories emerged from the 
data analysis. Four of these categories dealt expressly with en- 
gagement as students were or were not motivated to engage in the 
processing of their inquiry science experiences, whereas the fifth 
dealt with practical challenges associated with the use of the 
UDSN in comparison to paper-and-pencil science notebooks. 

Interest, excitement. Without exception, student focus 
groups and teacher interviews noted high excitement and/or inter- 
est among students in using the UDSN, with students also report- 
ing that the UDSN was more “fun” than paper-and-pencil science 
notebooks. When students offered an explanation as to why the 
UDSN was more “fun,” they most often noted that the UDSN 
reflected a personal interest in technology. A few students noted 
that technology, although ubiquitous in their personal lives, was 
not abundant at school. 
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.-. because the [UDSN] is not just paper pages. Lots of kids love 
electronics, like I do. At home it’s like cell phones and computers and 
TV. It’s more fun for me to get on the computer and work instead of 
writing with pencil. [Student Comment] 


I think they [students] were way more engaged. They were more 
energized. It was exciting to them. And I really was kind of thinking 
that it might wear off a little bit as they got into it, but I don’t think 
it did at all. [Teacher Comment] 


As illustrated in the quote above, teachers typically noted that the 
excitement level did not wane over the 8- to 10-week implemen- 
tation period despite significant challenges related to hardware and 
broadband availability. 

Doing science, going deeper. Within the structured compo- 
nent of the interviews and focus groups, teachers and students were 
prompted to think about and then explain how the UDSN did or 
did not help science learning. Again, most students began with the 
idea that the UDSN was more “fun” than paper-and-pencil science 
notebooks but then expanded to articulate that because the UDSN 
was more “fun,” they spent more time “doing” science. 


Almost everybody likes things being exciting. The [UDSN] was 
exciting and fun ... when it’s fun, you want to do it longer and with 
[UDSN], you just really don’t want to get off. You want to keep doing 
it. [Student Comment] 


A student in one focus group reported that getting time on the 
computer to work on science using the UDSN became competitive 
(see below). Most focus groups reported that they found ways to 
use the UDSN outside of class time even though their teacher did 
not require it. 


I think we [students] did a lot more text and detailed pictures because 
we want to stay on the [UDSN] longer. And if class was over, we save 
what we have and go and do the [UDSN] by ourselves, which was 
really fun. And if people said, hey, it’s my turn, you’ve been on for 
like a whole class, it’s like sorry, I’m doing my stuff. [Student 
Comment] 


Taking ownership, showing science thinking. When asked 
what they spent more time doing using the UDSN, students most 
often reported that they worked on building their explanations or 
revising work as prompted by their teacher. Students noted that in 
comparison to paper science notebooks, they were more likely to 
attempt an explanation when using the UDSN: “Before we knew 
about [UDSN], we just did experiments ... we normally didn’t 
explain much” [Student Comment]. 

These comments were confirmed in teacher interviews where 
eight of the 11 teachers interviewed noted that with the UDSN, 
they were able to see student thinking about science more clearly, 
and more often than when they were using paper notebooks. One 
teacher commented that this was the first time she had been able to 
see her students’ original thinking about their science experiences 
in her class: “When they were working on the [UDSN], it was 
pretty much their [students] own original work and, so I was able 
to see not my thinking, but their thinking for the first time” 
[Teacher Comment]. Students expressed similar thoughts in four 
of the 14 focus groups. “It’s not just the teacher telling you the 
answers, it’s actually you getting into your work and actually 
doing it” [Student Comment]. Most teachers noted that their stu- 
dents seemed to have a greater sense of ownership over their work 
as represented within the UDSN. 
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It gave them more of the sense of ownership of their own [science 
note-] book so they were more willing to keep up with it and keep 
working it. Many of them asked me, ‘I didn’t get finished with that 
lab, can I log onto UDSN and do that?’ And of course when they are 
doing their paper notebooks, you know, usually I have to say to 
students, “You have to finish this!” and it’s not them asking me ‘Can 
I go in there and do this?’” [Teacher Comment] 


As illustrated in this quote, most teachers interviewed had hypoth- 
eses as to why student effort and excitement remained high over 
the course of the implementation, as well as why student thinking 
about science was more accessible with the UDSN. In the quote 
reported above, the teacher articulates her idea that student’s sense 
of ownership reciprocally contributed to and reinforced high levels 
of interest, excitement, and effort. 

Feeling competent, showing what you know. Without ex- 
ception, student focus groups indicated that they liked having 
access to the UDSN contextual supports (e.g., Figure 1B and 
Figure 1C; Figure 2 sentence starter supports) and that the pres- 
ence and use of these supports made them feel more confident and 
competent in their work. For example, several students commented 
that sometimes they have trouble understanding what their teacher 
is asking them to do and expressed satisfaction with having access 
to guidance on inquiry activities through the UDSN. “My teacher 
sometimes tells us to do something, and I don’t really understand, 
so on the [UDSN] they have videos that tell us how to do it 
{inquiry activity] and that was great” [Student Comment]. Whereas 
another student noted, “I liked [UDSN] because if you ever got 
stuck on something and you forgot what to do, you could just click 
on the help thing on the side and it would like appear and help 
you” [Student Comment]. One student described the UDSN as a 
resource to be leveraged in the process of doing his science work: 
“The best thing about the [UDSN} is that it’s like a backup 
resource” [Student Comment]. 

Although all of the student focus groups commented on the 
utility of various contextual support features of the UDSN, the 
reported usefulness of specific supports across students was highly 
varied. Some students placed high value on mechanical supports 
like spell check (see Figure 2), whereas others focused on concep- 
tual or organizational supports. All of the focus groups commented 
positively on the sentence starter supports (see Figure 2) and 
offered to help students begin their explanations. 


I thought if you really didn’t know how to start your explanation, I 
thought it kind of helped you get an idea of how to make the words. 
The sentence starter, it gives me ideas and I just .. .I just say, ‘Hey! 
This could be easy!’ whenever I think it’s hard.” [Student Comment] 


Likewise, all of the teachers interviewed reflected on the utility of 
the contextual supports in facilitating student independence in their 
inquiry science work and science learning. One teacher hypothe- 
sized that the presence of the contextual supports was anxiety 
reducing for her students, “I think having all the resources in the 
[UDSN] took some of the anxiety out of it for them [students]. Just 
knowing that they [students] had help there even if they didn’t use 
it, they knew they could” [Teacher Comment]. Half the teachers 
interviewed reported that the UDSN contextual supports and or- 
ganizing structure were helpful to their practice, 


I think it also made me more accountable in some ways. Just because 
I used it on my SMART Board with the [UDSN] so much that it kept 
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me more structured. It kept me focused, I was more in control of the 
lesson. [Teacher Comment] 


All of the teachers interviewed and each of the student focus 
groups noted the challenges of the paper-and-pencil format for 
students in communicating what they know at the fourth-grade 
level, 


[I]n a paper-and-pencil notebook, you know, there are a lot of kids 
who are challenged with handwriting and, honestly, spelling is atro- 
cious, so when they go back to re-read something or use their 
notebook to study, they don’t even know what they wrote. [Teacher 
Comment] 


Nine of the 11 teachers interviewed indicated that the response 
options (see Figure 2) were essential to the utility and effectiveness 
of the UDSN. 


The best thing about the UDSN is that there are so many ways kids 
can do things—you can upload pictures, you can draw, can record, 
you can type and translate. That’s the multiple options of express- 
ing their learning. It helped me realize that some students who 
were lower were getting it more than I thought, and then I could 
focus my teaching more productively. [Teacher Comment] 


One teacher noted that the multiple response option feature of the 
UDSN was especially important for the students in her class with 
IEPs: 


for me to hear their original thinking and for the first time see that 
understanding of the science concepts—I think that was a good 
moment. Especially the students with IEPs, it was the first time I 
could see what they knew about science. [Teacher Comment] 


Practical challenges. Without exception, teachers and stu- 
dents commented on the frustration they felt in the practical 
challenges they faced in the using the UDSN. These challenges 
were unique to the particular issues within each school. All but 
two schools in the study had hardware-related challenges, with 
only four or five computers in each classroom and a computer 
lab that was sometimes difficult to schedule, “I only had four 
computers for them to use, and I have 30 children, so that was 
a barrier” [Teacher Comment]. 


I had five computers here in my classroom. They used those and then 
I sent some into the computer lab. I would ask the computer teacher 
how many computers are free in the lab, and then I would send that 
number of students down there. And then in the media center, I signed 
up for blocks of time where we would have 10 computers. So my kids 
were kind of farmed out all over the building at times! [Teacher 
Comment] 


Most schools lacked access to adequate broadband given the 
number of students who were trying to access the Internet at the 
same time (including for classes not using the UDSN) and the 
amount of data that needed to be transferred. For students and 
teachers, the result was a sense that the UDSN was not working 
fast enough, “one bad thing I had myself, was it took too long 
to load, that was really frustrating” [Student Comment]. Most 
teachers noted that they had access to some technology but that 
the technology that they had access to lagged behind the kinds 
of things they wanted to use the technology for in their teach- 
ing, 
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at least at this school, we’re not technology starved, but we still don’t 
have access like we really should. I have ideas about things to do and 
would like to do with technology. There are lots of interesting pro- 
grams out there that could be helpful to me, but most of the time I 
don’t bother because it’s so hard to deal with the technology in my 
school [Teacher Comment]. 


Discussion 


Previous research on the use and effectiveness of science note- 
books demonstrates the substantial challenge of supporting strug- 
gling students to effectively engage in desired science learning 
behaviors (Englert et al., 1988; Graham, 1990; Graham et al., 
1991; Swanson, 1999; Wong, 2001). Yet, the results of this inves- 
tigation indicate that the UDSN was successful in fostering im- 
proved outcomes in science knowledge, as compared with tradi- 
tional paper-and-pencil science notebooks. The UDSN “raised all 
boats,” including for those students who exhibited low reading and 
writing proficiency, and low motivation for science at pretest. 
These findings are especially remarkable given that students used 
the UDSN only an average of one time each week, and only an 
average of 10 times total over the course of the implementation. 
What aspects of the UDSN contributed to these striking outcomes? 
The UDSN was designed both to overcome accessibility-related 
and construct-irrelevant barriers to learning and to provide con- 
textual supports that promote the deep science learning intended 
through the use of science notebooks. How did these intentions 
play out in use of the UDSN? 

Although for each individual student, we are not able to disen- 
tangle which specific aspects of the UDSN’s design were most 
critical, this kind of analysis would not necessarily provide mean- 
ingful insight according to the learning design used. Although 
certain features are necessary for accessibility for certain popula- 
tions (e.g., alt text on images for students with low vision), the 
emphasis in design was not on providing particular features for 
particular audiences, but rather on including options and contex- 
tual supports that are likely to improve access to learning for 
all—this is the UDL approach. A given student might in one 
notebook session benefit most from using text-to-speech to under- 
stand the procedure in an inquiry-based activity, and the same 
student might later have little difficulty with a reading segment but 
use an animated coach to think about how to craft an explanation. 
As suggested by the interviews and the finding of equal benefits 
across students of varying levels of ability and motivation, this 
flexibility seemed to have an overall positive impact. We turn now 
to aspects of that gestalt that emerge as likely contributors to 
effectiveness of the UDSN. 


User-Experience Design and UDL 


Technology in and of itself is not a means to successful learning 
outcomes, but rather a more flexible platform on which content 
and learning experiences can be rendered. The UDSN leveraged 
this flexibility to enact the theory of change and design of the 
program using the principles of UDL. The purpose of UDL is to 
facilitate the creation and study of learning environments that are 
usable by and effective for as many learners as possible. It is an 
approach that attempts to leverage the learning sciences in the 
user-experience design of educational environments, a kind of 
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continuously improving framework for active translation between 
research and practice (Rappolt-Schlichtmann et al., 2012). 

The design of education experiences when UDL is leveraged 
should expressly focus on creating “desirable difficulties” for 
students, while reducing construct-irrelevant barriers to learning 
(Bjork, 1994). In this way, designers focus on rendering environ- 
ments that create challenges for students that are most central to 
the targeted learning goal or process, while simultaneously reduc- 
ing the unintended effects of obstacles that are tangential to key 
learning goals. The resulting experience is one in which students 
feel competent and confident to share and show what they know. 

Reducing construct-irrelevant barriers. This kind of user- 
experience outcome was evident in the UDSN implementation in 
which teachers reported seeing students’ original thinking for the 
first time, and students reported feeling a renewed sense of own- 
ership over and competence with their science work. For example, 
providing multiple means of expression (a UDL design principle) 
through the UDSN platform allowed students with disabilities and 
those with handwriting or expressive difficulty alike to demon- 
strate their science knowledge to their teacher through means that 
were most effective for them. Teachers then had a more productive 
platform to engage in a recursive feedback and revision process 
with students, targeted to students’ specific level of science un- 
derstanding. 

With paper-and-pencil science notebooks, teachers’ knowledge 
of students’ understanding of science is more likely to be obscured 
because student explanations are at once a representation of their 
competence at written expression and their thinking about science. 
Although handwriting may be an important learning goal in some 
instances, there is no need for it to be a barrier to some students in 
reaping the benefits of a science notebook; in this case, building 
science content knowledge is the primary learning goal. Impor- 
tantly, the writing process as it supports science learning through 
the use of the science notebook is maintained, but those students 
with low literacy levels can alternatively choose to record data 
and/or compose explanations by audio recording or drawing. And, 
contextual supports are provided to remediate other barriers like 
spelling and handwriting. This tight focus on reducing construct- 
irrelevant barriers, while amplifying the central goals of the cur- 
riculum through the provision of contextual supports in a flexible, 
digital learning environment, allowed all students to have access to 
learning and express their thinking about science in ways that were 
accessible to their teachers. 

Creating desirable difficulty. With construct-irrelevant bar- 
riers reduced or eliminated, contextual supports can be levied to 
create “desirable difficulty.” The learning design is purposefully 
shaped to allow students to calibrate their own levels of challenge, 
without diluting the science concepts and productive scientific 
behaviors. Research from various perspectives emphasizes the key 
role of balance between the level of challenge in the environment 
and one’s perceived skills and resources as the driving force in 
shaping affective responses and cognitive engagement (Blascov- 
ich, Mendes, Tomaka, Saloman, & Seery, 2003; Csikszentmihalyi, 
1991; Daley & Rappolt-Schlichtmann, 2009; Lazarus & Folkman, 
1984). For example, Blascovich and colleagues describe “chal- 
lenge” motivational states when an individual perceives his or her 
resources as in balance with the demands of a task (Blascovich et 
al., 2003). Challenge states promote cognitive flexibility and de- 
cision making and are characterized by energized, active psycho- 


physiological states. In a related framework, Lazarus and Folkman 
(1984) provide a model of appraisal and adaptation in which 
positive emotions emerge from “challenging” experiences charac- 
terized by closely leveled demands and resources; such experi- 
ences lead to the mobilization of energy and promote the effort to 
respond. 

UDL leverages these concepts to provide a framework by which 
designers within education can systematically consider the provi- 
sion of contextual supports to create balanced appraisals of learn- 
ing challenges by diverse students. Such balanced appraisals create 
the conditions necessary for deep engagement to occur. Engage- 
ment in learning is at once emotional and cognitive, and is 
achieved through the application of appropriate challenge and 
calibrated to individual learners’ specific strengths and weak- 
nesses. A student can choose to access or ignore a given support, 
to use any of the various means of responding to a prompt, and to 
watch or pass by a video that provides additional information. 
Designers anticipate and reduce or eliminate barriers to deep 
engagement by providing options and supports that render the 
learning environment flexible and maintain a focus on specific 
learning goals (Meyer & Rose, 1998; Rose & Meyer, 2002). 
Technology enhances the degree of flexibility. 

The importance of contextual supports to the productive use of 
the UDSN and the creation of engagement in science learning was 
confirmed in both the quantitative and qualitative analyses. Teach- 
ers commented on the utility of the contextual supports for stu- 
dents in their science learning, and even hypothesized as to the 
emotional benefits in reducing anxiety and promoting more inde- 
pendent, confident work. When prompted to think about and share 
how the UDSN did or did not help their science learning, students 
often expressed feelings of agency and confidence knowing that 
the UDSN offered resources to help them if needed. Some indi- 
cated that when using supports, building explanations seemed 
more “doable,” where success was possible even though the task 
felt difficult. 

In the qualitative research, competence (Harter, 1978; R. W. 
White, 1963) and autonomy (deCharms, 1968; Deci, 1975) sur- 
faced as key themes among students and teachers in reporting their 
perceptions of the usefulness of the UDSN to science learning. 
Although understandable, this finding was not expected and sug- 
gests an avenue for future work. Students and teachers reported 
that in using the UDSN, students felt more ownership, agency, and 
control over the work as they were supported to build skills and 
feel competent in the executing of the inquiry process and produc- 
ing explanations describing their science experiences. It may be 
that the overwhelmingly high and persistent levels of interest and 
excitement reported among students using the UDSN were, at least 
in part, attributable to the generation of feelings of competence and 
autonomy in their work. Building from self-determination theory 
(SDT), where educational contexts are seen to catalyze within and 
between person differences in motivation (Deci & Ryan, 1985, 
1991; R. M. Ryan, 1995), a stronger focus on creating a sense of 
relatedness (Baumeister & Leary, 1995; Reis, 1994) through the 
design of the UDSN may have further optimized student’s perfor- 
mance, engagement, and feelings of well-being. It would be useful 
to do a deeper analysis relating the engagement principle from the 
UDL framework with the research literature on SDT especially 
with regard to those research practice models that experimentally 
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describe conditions that foster versus undermine human potential 
in learning. 


Designing for Variability in Engagement and Learning 


It is important to note that although most students thought that 
contextual supports were useful to their science learning, students 
were highly variable in whether they typically used supports and 
found them helpful. This was not unexpected. Taking a cue from 
the learning sciences, UDL assumes that variability is the rule and 
not the exception in learning and that learning is actively organized 
and context specific (Fischer & Bidell, 2006; Plomin & Kovas, 
2005; Rappolt-Schlichtmann, Tenenbaum, Koepke, & Fischer, 
2007; Thelen & Smith, 1994; Van Geert, 1998). Learners differ 
markedly in the ways in which they can be engaged or motivated 
to learn, so when designing contextual supports using the UDL 
framework, developers construct environments that offer multiple 
means and options to provide access to supports that should allow 
students to reappraise challenging tasks as demanding but doable. 

One major challenge that emerges from this approach is that 
students vary in the degree to which they make active and good 
choices about the supports they leverage, and in fact the literature 
suggests that students who would most benefit from embedded 
supports are often least likely to choose to access them (A. M. 
Ryan & Pintrich, 1997, 1998; A. M. Ryan, Pintrich, & Midgley, 
2001). For this reason, teachers play an important and specific role 
in the implementation of support-rich UDL environments. They 
are facilitators and mediators, helping students to learn how to best 
leverage the designed environment in the service of learning. They 
are keen observers and effective users of data to understand the 
strengths and weaknesses of their students and, when necessary, 
guide the relationship between the student and the learning envi- 
ronment. 

Our quantitative findings reflect these challenges. The guidance 
teachers offered students in leveraging the UDSN and the ap- 
proach to instruction teachers used clearly played a role in how 
students used the UDSN. In the context of UDL-designed envi- 
ronments, instruction that is (a) adaptable to student strengths and 
weaknesses as students change and (b) carefully planned but 
responsive to “in the moment” teaching opportunities will be more 
effective than one-size-fits all, static approaches (Connor et al., 
2009). Additionally, the effective use of real-time opportunities 
depends on teachers’ levels of expertise with the pedagogical 
approach, the curriculum, the learning goals, and the students’ 
needs. Indeed, we found that students whose teachers had more 
experience using science notebooks in instruction tended to dem- 
onstrate higher frequency of process behaviors. 


Limitations 


With regard to the design of the research study, we were not able 
to consider what process behaviors were used in the control group. 
Without an equivalent of the electronic usage log, we could not 
determine how often, for example, students reviewed data when 
entering explanations or how often they revised previous notebook 
entries. Future work could incorporate this type of comparison. 

In addition and while the impact of the UDSN was positive in 
this implementation, teachers generally used the UDSN much less 
frequently than intended. The web-based notebook was meant to 
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be a part of regular science activities—a place to record data 
during observations, take notes during interactions with teachers, 
and work regularly throughout the curriculum. Instead, it often 
became a place for weekly entering of previously collected data 
and focused reflection. In part, this limitation reflects the state of 
broadband and computer access in public schools. 

Access to adequate hardware and the Internet is still a signifi- 
cant problem in American public schools. Estimates indicate that 
80% of public schools do not have Internet access adequate for 
their instructional needs, and school leadership at the elementary 
level often underestimates the degree to which teachers and stu- 
dents can use computers and especially the Internet as a part of the 
normal course of teaching and learning (Fox, Waters, Fletcher, & 
Levin, 2012). Studies like this one should help to raise awareness 
as to the utility of elementary level computer and Internet use, as 
well as the limitations of current infrastructure. 


Conclusion 


Interest in the UDL framework has increased exponentially over 
the last decade. The Higher Education Opportunity Act of 2008 
(HEOA, 2012, 20 U.S.C. § 1003(24)) established the statutory 
definition for UDL, strongly suggesting that preservice teacher 
training incorporate instruction on strategies consistent with UDL 
(HEOA, 2012, 20 U.S.C. § 1022d(b)(1)(K)), and the U.S. Depart- 
ment of Education’s National Educational Technology Plan 2010 
makes frequent reference to UDL as a framework that reduces 
barriers and maximizes learning opportunities for all students 
(U.S. Department of Education, Office of Educational Technol- 
ogy, 2010). This interest is understandable because the framework 
provides a possible answer to the growing call for more “person- 
alized” curricular materials that have the potential to accommodate 
the full diversity of learners and teachers within the education 
system, and, furthermore, the UDL framework explicitly reflects 
our best understanding of the learning sciences as it can be applied 
to education design. However, research exploring UDL as an 
approach to education design is still in its infancy. 

This study is one of only a handful of experimental studies 
looking at either the overall impact of a UDL technology in 
authentic classrooms or the active components of UDL technology 
in the process of development. To our knowledge, this is the only 
study to qualitatively explore students’ and teachers’ experiences 
and perceptions of usefulness of UDL technology to learning. 
There is substantial opportunity to explore, examine, and inform 
theory and research concerning the nature of learning and devel- 
opment from a practice-oriented perspective. Such work along 
with implementation and design-based research on UDL environ- 
ments is warranted. Defining a research agenda for UDL is beyond 
the scope of this article, but several other published articles make 
suggestions in this regard (e.g., Rappolt-Schlichtmann et al., 
2012). 

Digital technologies used in conjunction with strong teaching 
strategies offer unprecedented opportunities to support developing 
active science learning skills, but the technology-based format 
does not automatically improve on the print format. Many 
technology-based programs and digital materials are inaccessible, 
and the impact of existing technology-based programs, even when 
effective, are typically small (e.g., see “IES What Works Clear- 
inghouse”; _http://ies.ed.gov/ncee/wwe/reports/Topicarea.aspx? 
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tid = 15). However, as this work demonstrates, when technology 
is used to foster a supported learning environment in which the 
emphasis is on core learning activities, with strong teacher expe- 
rience and embedded support for construct-irrelevant skills and 


strategies, technology can provide consistent gains for a variety of 
learners. 
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